python / cpython

The Python programming language
https://www.python.org
Other
63.42k stars 30.37k forks source link

email.header uses re.IGNORECASE without re.ASCII #75858

Closed methane closed 7 years ago

methane commented 7 years ago
BPO 31677
Nosy @warsaw, @ezio-melotti, @methane, @serhiy-storchaka
PRs
  • python/cpython#3868
  • python/cpython#7856
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['expert-regex', '3.7'] title = 'email.header uses re.IGNORECASE without re.ASCII' updated_at = user = 'https://github.com/methane' ``` bugs.python.org fields: ```python activity = actor = 'hloeung' assignee = 'none' closed = True closed_date = closer = 'methane' components = ['Regular Expressions'] creation = creator = 'methane' dependencies = [] files = [] hgrepos = [] issue_num = 31677 keywords = ['patch'] message_count = 6.0 messages = ['303612', '303613', '303615', '303668', '303669', '303695'] nosy_count = 5.0 nosy_names = ['barry', 'ezio.melotti', 'mrabarnett', 'methane', 'serhiy.storchaka'] pr_nums = ['3868', '7856'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue31677' versions = ['Python 3.7'] ```

    methane commented 7 years ago

    email.header has this pattern:

    https://github.com/python/cpython/blob/85c0b8941f0c8ef3ed787c9d504712c6ad3eb5d3/Lib/email/header.py#L34-L43

    # Match encoded-word strings in the form =?charset?q?Hello_World?=                       
    ecre = re.compile(r'''                                                                   
      =\?                   # literal =?                                                     
      (?P<charset>[^?]*?)   # non-greedy up to the next ? is the charset                     
      \?                    # literal ?                                                      
      (?P<encoding>[qb])    # either a "q" or a "b", case insensitive                        
      \?                    # literal ?                                                      
      (?P<encoded>.*?)      # non-greedy up to the next ?= is the encoded string             
      \?=                   # literal ?=                                                     
      ''', re.VERBOSE | re.IGNORECASE | re.MULTILINE)

    Since only 's' and 'i' has other lower case character, this is not a real bug. But using re.ASCII is more safe.

    Additionally, email.util has same pattern from 10 years ago, and it is not used by anywhere. It should be removed.

    serhiy-storchaka commented 7 years ago

    Alternatively, re.IGNORECASE can be removed and [qb] replaced with [QqBb].

    warsaw commented 7 years ago

    I think using re.ASCII is a good addition since RFC 2047 says:

    Generally, an "encoded-word" is a sequence of printable ASCII characters that begins with "=?", ends with "?=", and has two "?"s in between. It specifies a character set and an encoding method, and also includes the original text encoded as graphic ASCII characters, according to the rules for that encoding method.

    It's better to keep the re.IGNORECASE since the RFC also says:

    Both 'encoding' and 'charset' names are case-independent. Thus the charset name "ISO-8859-1" is equivalent to "iso-8859-1", and the encoding named "Q" may be spelled either "Q" or "q".

    methane commented 7 years ago

    New changeset bf477a99e0c85258e6573f4ee9eda68fa1f98a31 by INADA Naoki in branch 'master': bpo-31677: email: Remove re.IGNORECASE flag (GH-3868) https://github.com/python/cpython/commit/bf477a99e0c85258e6573f4ee9eda68fa1f98a31

    methane commented 7 years ago

    It's better to keep the re.IGNORECASE since the RFC also says:

    Both 'encoding' and 'charset' names are case-independent. Thus the charset name "ISO-8859-1" is equivalent to "iso-8859-1", and the encoding named "Q" may be spelled either "Q" or "q".

    I'm sorry, I've committed before reading this. But I think it's not problem, because re.IGNORECASE doesn't affect to "(?P\<charset>[^?]*?)" pattern.

    warsaw commented 7 years ago

    On Oct 3, 2017, at 23:51, INADA Naoki \report@bugs.python.org\ wrote:

    > It's better to keep the re.IGNORECASE since the RFC also says: > > Both 'encoding' and 'charset' names are case-independent. Thus the > charset name "ISO-8859-1" is equivalent to "iso-8859-1", and the > encoding named "Q" may be spelled either "Q" or "q".

    I'm sorry, I've committed before reading this. But I think it's not problem, because re.IGNORECASE doesn't affect to "(?P\<charset>[^?]*?)" pattern.

    I think you’re change is fine, no need to revert or modify it.