python / cpython

The Python programming language
https://www.python.org
Other
62.47k stars 29.99k forks source link

email.Message.as_string infinite loop #81945

Closed 9865f18d-5ae6-4271-be9f-b04f790afdfa closed 5 years ago

9865f18d-5ae6-4271-be9f-b04f790afdfa commented 5 years ago
BPO 37764
Nosy @warsaw, @bitdancer, @maxking, @miss-islington, @epicfaace
PRs
  • python/cpython#15239
  • python/cpython#15654
  • python/cpython#15686
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-security', '3.7', '3.8', 'expert-email', '3.9'] title = 'email.Message.as_string infinite loop' updated_at = user = 'https://bugs.python.org/mytran' ``` bugs.python.org fields: ```python activity = actor = 'epicfaace' assignee = 'none' closed = True closed_date = closer = 'maxking' components = ['email'] creation = creator = 'mytran' dependencies = [] files = [] hgrepos = [] issue_num = 37764 keywords = ['patch'] message_count = 16.0 messages = ['349055', '349196', '349199', '349200', '349202', '349277', '349358', '349505', '349847', '349848', '349849', '349962', '350919', '351089', '351160', '351177'] nosy_count = 6.0 nosy_names = ['barry', 'r.david.murray', 'maxking', 'miss-islington', 'epicfaace', 'mytran'] pr_nums = ['15239', '15654', '15686'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'security' url = 'https://bugs.python.org/issue37764' versions = ['Python 3.7', 'Python 3.8', 'Python 3.9'] ```

    9865f18d-5ae6-4271-be9f-b04f790afdfa commented 5 years ago

    The following will hang the system until it runs out of memory.

    import email
    import email.policy
    
    text = """From: user@host.com
    To: user@host.com
    Bad-Header:
     =?us-ascii?Q?LCSwrV11+IB0rSbSker+M9vWR7wEDSuGqmHD89Gt=ea0nJFSaiz4vX3XMJPT4vrE?=
     =?us-ascii?Q?xGUZeOnp0o22pLBB7CYLH74Js=wOlK6Tfru2U47qR?=
     =?us-ascii?Q?72OfyEY2p2=2FrA9xNFyvH+fBTCmazxwzF8nGkK6D?=

    Hello! """

    eml = email.message_from_string(text, policy=email.policy.SMTPUTF8)
    eml.as_string()
    ccbe4879-e54b-43e7-9c63-280a74d52875 commented 5 years ago

    I can't reproduce this on 3.9: https://github.com/epicfaace/cpython/runs/187997615

    ccbe4879-e54b-43e7-9c63-280a74d52875 commented 5 years ago

    I also can't reproduce this on 3.7: https://github.com/epicfaace/cpython/runs/188005822

    9865f18d-5ae6-4271-be9f-b04f790afdfa commented 5 years ago

    Reproduced on 3.7.4

    Looks like this started happening after this commit: https://github.com/python/cpython/commit/dc20fc4311dece19488299a7cd11317ffbe4d3c3#diff-19171ae20182f6759204a3436475ddd1

    9865f18d-5ae6-4271-be9f-b04f790afdfa commented 5 years ago

    I looked at the job at https://travis-ci.com/epicfaace/cpython/jobs/223345147 and its running py3.6.

    maxking commented 5 years ago

    This does look like a side-effect of the commit mentioned by mytran.

    The issues seems to be that email._header_value_parser.get_unstructured wrongfully assumes that anything leading with '=?' would be a valid rfc 2047 encoded word.

    This is a smaller test case to reproduce this bug:

    from email._header_value_parser import get_unstructured
    get_unstructured('=?utf-8?q?FSaiz4vX3XMJPT4vrExGUZeOnp0o22pLBB7CYLH74Js=3DwOlK6Tfru2U47qR?=72OfyEY2p2/rA9xNFyvH+fBTCmazxwzF8nGkK6D')
    maxking commented 5 years ago

    Adding security label since this can cause DOS.

    ccbe4879-e54b-43e7-9c63-280a74d52875 commented 5 years ago

    Oh, both the Travis links I sent actually ended up reproducing the bug.

    I've made a PR that fixes with an even smaller test case:

    get_unstructured('=?utf-8?q?somevalue?=aa')

    It looks like this is caused because "aa" is thought to be an encoded word escape in https://github.com/python/cpython/blob/fd5a82a7685d1599aab12e722a383cb0a2adfd8a/Lib/email/_header_value_parser.py#L1042 -- thus, get_encoded_word fails, which ends up making get_unstructured go in an infinite loop.

    My PR makes the parser parse "=?utf-8?q?somevalue?=aa" as "=?utf-8?q?somevalue?=aa". However, the existing test cases make sure it parses "=?utf-8?q?somevalue?=nowhitespace" as "somevaluenowhitespace". I'm not too familiar with RFC 2047, but why are "aa" and "nowhitespace" treated differently? Should they be?

    maxking commented 5 years ago

    You have correctly identified that "=aa" is detected as a encoded word and causes the get_encoded_word to fail.

    However, "=?utf-8?q?somevalue?=aa" should ideally get parsed as "somevalueaa" and not "=?utf-8?q?somevalue?=aa". This is because "=?utf-8?q?somevalue?=" is a valid encoded word, it is just not followed by an empty whitespace.

    modified   Lib/email/_header_value_parser.py
    @@ -1037,7 +1037,10 @@ def get_encoded_word(value):
             raise errors.HeaderParseError(
                 "expected encoded word but found {}".format(value))
         remstr = ''.join(remainder)
    -    if len(remstr) > 1 and remstr[0] in hexdigits and remstr[1] in hexdigits:
    +    if (len(remstr) > 1 and
    +        remstr[0] in hexdigits and
    +        remstr[1] in hexdigits and
    +        tok.count('?') < 2):
             # The ? after the CTE was followed by an encoded word escape (=XX).
             rest, *remainder = remstr.split('?=', 1)

    This can be avoided by checking ? occurs twice in the tok.

    The 2nd bug, which needs a better test case, is that if the encoded_word is invalid, you will keep running into infinite loop, which you correctly fixed in your PR. However, the test case you used is more appropriate for the first issue.

    You can fix both the issues, for which, you need to add a test case for 2nd issue and fix for the first issue.

    Looking into the PR now.

    maxking commented 5 years ago

    I meant, =aa is identified as encoded word escape

    maxking commented 5 years ago

    Although, the 2nd bug I spoke of is kind of speculative, I haven't been able to find a test case which matches rfc2047_matcher but raises exception with get_encoded_word (after, ofcourse, the first bug is fixed), which the only way to cause an infinite loop.

    ccbe4879-e54b-43e7-9c63-280a74d52875 commented 5 years ago

    Thanks, I've fixed the first case as you suggested.

    I found an example of the 2nd case -- '=?utf-8?q?=somevalue?=' -- which causes the method to hang. I've added a fix, though I'm not sure if it treats the string properly -- it parses it as '=?utf-8?q?=somevalue?=' and doesn't raise any defects. Is that the behavior we would want?

    miss-islington commented 5 years ago

    New changeset c5b242f87f31286ad38991bc3868cf4cfbf2b681 by Miss Islington (bot) (Ashwin Ramaswami) in branch 'master': bpo-37764: Fix infinite loop when parsing unstructured email headers. (GH-15239) https://github.com/python/cpython/commit/c5b242f87f31286ad38991bc3868cf4cfbf2b681

    miss-islington commented 5 years ago

    New changeset ea21389dda401457198fb214aa2c981a45ed9528 by Miss Islington (bot) (Ashwin Ramaswami) in branch '3.7': [3.7] bpo-37764: Fix infinite loop when parsing unstructured email headers. (GH-15239) (GH-15654) https://github.com/python/cpython/commit/ea21389dda401457198fb214aa2c981a45ed9528

    maxking commented 5 years ago

    New changeset 6ad0a2c45f78020f7994e47620c1cf7b225f8197 by Abhilash Raj in branch '3.8': [3.8] bpo-37764: Fix infinite loop when parsing unstructured email headers. (GH-15239) (GH-15686) https://github.com/python/cpython/commit/6ad0a2c45f78020f7994e47620c1cf7b225f8197

    ccbe4879-e54b-43e7-9c63-280a74d52875 commented 5 years ago

    Should we get a CVE for this because this is a security issue?