python / cpython

The Python programming language
https://www.python.org
Other
63.42k stars 30.37k forks source link

email: get_boundary (and get_filename) invokes unquote twice #73131

Open 44b4c87c-fe02-447d-bbdd-58861e3ff3f2 opened 7 years ago

44b4c87c-fe02-447d-bbdd-58861e3ff3f2 commented 7 years ago
BPO 28945
Nosy @warsaw, @bitdancer, @elafontaine
Files
  • issue_28945.patch: utils.collapse_rfc2231_value doesn't unquote on this one
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-bug', 'expert-email'] title = 'email: get_boundary (and get_filename) invokes unquote twice' updated_at = user = 'https://bugs.python.org/bpoaugust' ``` bugs.python.org fields: ```python activity = actor = 'vstinner' assignee = 'none' closed = False closed_date = None closer = None components = ['email'] creation = creator = 'bpoaugust' dependencies = [] files = ['46003'] hgrepos = [] issue_num = 28945 keywords = ['patch'] message_count = 15.0 messages = ['282991', '283651', '283656', '283678', '283680', '283693', '283702', '283703', '283704', '283850', '283851', '283852', '283853', '284294', '284502'] nosy_count = 4.0 nosy_names = ['barry', 'r.david.murray', 'bpoaugust', 'Eric Lafontaine'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue28945' versions = ['Python 3.4', 'Python 3.5'] ```

    44b4c87c-fe02-447d-bbdd-58861e3ff3f2 commented 7 years ago

    get_boundary calls get_param('boundary') which unquotes the value. It then calls utils.collapse_rfc2231_value which also calls unquote.

    This causes problems for boundaries that have two sets of quotes. For example, I have seen the following in the wild:

    Content-Type: multipart/mixed; boundary="\<\<001-3e1dcd5a-119e>>"

    Both "" and \<> are treated as quotes by unquote.

    The result is that both "\< and >" are stripped off. This means that the boundaries no longer match.

    703682f5-5f69-4dfc-a098-9a70ca9625fe commented 7 years ago

    Hi all,

    I believe this is the right behavior and what ever generated the boundary "\<\<>>" is the problem ;

    RFC 2046 page 22:


    The only mandatory global parameter for the "multipart" media type is the boundary parameter, which consists of 1 to 70 characters from a set of characters known to be very robust through mail gateways, and NOT ending with white space. (If a boundary delimiter line appears to end with white space, the white space must be presumed to have been added by a gateway, and must be deleted.) It is formally specified by the following BNF:

     boundary := 0*69<bchars> bcharsnospace
    
     bchars := bcharsnospace / " "
    
     bcharsnospace := DIGIT / ALPHA / "'" / "(" / ")" /
                      "+" / "\_" / "," / "-" / "." /
                      "/" / ":" / "=" / "?"

    In other words, the only valid boundaries characters are : 01234567890 abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'()+_,-./:=?

    Any other character should be removed to get the boundary right. I believe the issue is that it wasn't removed in the first place. It is a bug in my opinion, but the other way around :).

    Funny thing is that the unquote function only remove the first&last character it sees... either '\<' and the '"'...

    def unquote(str):
        """Remove quotes from a string."""
        if len(str) > 1:
            if str.startswith('"') and str.endswith('"'):
                return str[1:-1].replace('\\\\', '\\').replace('\\"', '"')
            if str.startswith('<') and str.endswith('>'):
                return str[1:-1]
        return str

    Now, if I modify unquote to only keep the list of character above, would I break something? Probably. I haven't found any other defining RFC about boundaries that tells me what was the format supported. Can someone help me on that?

    This is what the function should look like :

    import string
    def get_boundary(str):
        """ return the valid boundary parameter as per RFC 2046 page 22. """
        if len(str) > 1:
            import re
            return re.sub('[^'+
                          string.ascii_letters +
                          string.digits +
                          r""" '()+_,-./:=?]|="""
                          ,'',str
                          ).rstrip(' ')
        return str
    
    import unittest
    
    class boundary_tester(unittest.TestCase):
        def test_get_boundary(self):
            boundary1 = """ abc def gh< 123 >!@ %!%' """
            ref_boundary1 = """ abc def gh 123  '""" # this is the valid Boundary
            ret_value = get_boundary(boundary1)
            self.assertEqual(ret_value,ref_boundary1)
    
        def test_get_boundary2(self):
            boundary1 = ''.join((' ',string.printable))
            ref_boundary1 = ' 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\'()+,-./:?_' # this is the valid Boundary
            ret_value = get_boundary(boundary1)
            self.assertEqual(ret_value,ref_boundary1)

    I believe this should be added to the email.message.Message get_boundary function.

    Regards, Eric Lafontaine

    44b4c87c-fe02-447d-bbdd-58861e3ff3f2 commented 7 years ago

    I agree that strictly speaking the boundary is invalid. However: 'Be strict in what you generate, be liberal in what you accept' The mail package should never create such boundaries. However it should process them if possible.

    If the boundary definition is mangled by stripping out all the invalid characters, then it won't match the markers. So it won't solve the issue.

    Whereas ensuring that only a single level of quotes is removed does fix the issue.

    This is what works for me:

    def get_boundary(self, failobj=None):
        missing = object()
        # get_param removes the outer quotes
        boundary = self.get_param('boundary', missing)
        if boundary is missing:
            return failobj
        # Don't unquote again in collapse_rfc2231_value
        if not isinstance(boundary, tuple) or len(boundary) != 3:
            return boundary
        # RFC 2046 says that boundaries may begin but not end in w/s
        return utils.collapse_rfc2231_value(boundary).rstrip()

    I think the bug is actually in collapse_rfc2231_value - that should not do any unquoting, as the value will be passed to it already unquoted (at least in this case). However there might perhaps be some cases where collapse_rfc2231_value is called with a quoted value, so to fix the immediate problem it's safer to fix get_boundary. (I could have re-quoted the value instead, and let collapse_rfc2231_value do its thing.)

    unquote is correct as it stands - it should only remove the outer quotes. There may be a need to quote strings that just happen to be enclosed in quote chars.

    44b4c87c-fe02-447d-bbdd-58861e3ff3f2 commented 7 years ago

    It looks like a simpler alternative is to just change

    boundary = self.get_param('boundary', missing)
    to
    boundary = self.get_param('boundary', missing, unquote=False)

    and let collapse_rfc2231_value do the unquoting

    44b4c87c-fe02-447d-bbdd-58861e3ff3f2 commented 7 years ago

    According to RFC822, a quoted-string should only be wrapped in double-quotes.

    So I'm not sure why unquote treats \<> as quotes. If it did not, then again this issue would not arise.

    However maybe utils.unquote is needed by other code that uses \<>, so it cannot just be updated without further analysis.

    bitdancer commented 7 years ago

    I'm pretty sure you are correct in your guess that the utility is used by other code that depends on the \<> unquoting.

    The new email policies (that is, the new header parsing code) probably do a better job of this, but get_param and friends haven't been wired up to use them when available yet.

    703682f5-5f69-4dfc-a098-9a70ca9625fe commented 7 years ago

    Hi all,

    I hate this proposition. I feel it's a "victory" for the people who don't want to follow RFC standard and allow "triple"-quoting on things that aren't supposed to...

    Now that my opinion is said, I made 2 test case that should be added to the test_email file of the test library to support the change :


    def test_rfc2231_multiple_quote_boundary(self):
        m = '''\\

    Content-Type: multipart/alternative; \tboundary*0="\<\<This%20is%20even%20more%20"; \tboundary*1="%2A%2A%2Afun%2A%2A%2A%20"; \tboundary*2="is it not.pdf>>"

    ''' msg = email.message_from_string(m) self.assertEqual(msg.get_boundary(), '\<\<This is even more ***fun*** is it not.pdf>>') def test_rfc2231_multiple_quote_boundary2(self): m = '''\ Content-Type: multipart/alternative; \tboundary="\<\<This is even more >>";

    ''' msg = email.message_from_string(m) self.assertEqual(msg.get_boundary(), '\<\<This is even more >>')


    The problem however does lie within collapse function as you've mentionned (because you've taken the code from there haven't you? :P) : def collapse_rfc2231_value(value, errors='replace', fallback_charset='us-ascii'): if not isinstance(value, tuple) or len(value) != 3: return value \<=======removed unquote on value

    This doesn't seem to have broken any of the 1580 tests case of test_email (Python3.7), but I can't help but to feel weird about it. Could someone explain why we were unquoting there? and why use those weird condition (not isintance(value,tuple) or len(value) !=3)....?

    Regards, Eric Lafontaine

    703682f5-5f69-4dfc-a098-9a70ca9625fe commented 7 years ago

    N.B., I've tried to do the keyword parameter on the get_param inside the get_boundary as well as I though it was a good Idea as well, but it was breaking some test cases (multiple ones). So I didn't pursue.

    N.B. I'm guessing (guessing) unquote must've been used for email adresses as well i.e. "Eric Lafontaine \eric.lafontaine1@gmail.com\"

    I also didn't had the time to read all answer since yesterday, sorry for not adding it to my earlier answer.

    bitdancer commented 7 years ago

    The RFC compliance battle in email was lost long ago. We have to do our best to process what we get, and to produce only RFC correct output (unless retransmitting without change). That is Postel's law, and you can't successfully process Internet email without following the first part.

    I haven't looked at the proposed solution yet. Not sure when I'll get time, since I should try to think and research the questions (why was it done the way it is currently done?), which will take quite a bit of time.

    703682f5-5f69-4dfc-a098-9a70ca9625fe commented 7 years ago

    Hi,

    I would like to have a bigger set of user testing this patch before it gets approve... I really don't know all the extent to which it's a good/bad idea.

    The proposition was to not do the unquote in the rfc2231 collapse method. It doesn't break anything on my machine on all the email tests... but I can't feel safe.

    I also added the 2 test case for future support (multiline boundary and single line boundary). It's basically the code I pasted on my answer on the 2016-12-20.

    Let me know what you think. Is it there we should stop doing it?

    Regards, Eric Lafontaine

    703682f5-5f69-4dfc-a098-9a70ca9625fe commented 7 years ago

    Hi,

    I would like to thank you for keeping up with me. I may not be easy at times, but please make me understand if it doesn't make sense :).

    Thanks bpoaugust and David, Eric Lafontaine

    44b4c87c-fe02-447d-bbdd-58861e3ff3f2 commented 7 years ago

    I have just discovered the same problem with get_filename. Not surprising as its code is basically the same as get_boundary.

    Unix paths can contain anything, so it's not correct to remove special characters. [It's up to the receiving file system to decide how to deal with chars that are not valid for it; the original name must be passed unchanged]

    If the quoting/unquoting is fixed for filenames, then it should be OK for the boundary as well.

    I think collapse_rfc2231_value should assume that any unquoting has already been done, and should therefore not call utils.unquote at all.

    The get_param() method by default unquotes both single strings and encoded triplets, so it's certainly the case that get_boundary and get_filename will pass an unquoted value to rfc2231, as will any other caller that calls get_param with the default parameters.

    44b4c87c-fe02-447d-bbdd-58861e3ff3f2 commented 7 years ago

    Note: it's easy to create test e-mails with attachments using mutt.

    echo test | mutt -s "test" -a files... -- user@domain

    I did some testing with the following names:

    \<>

    \< "" "\<abc>" \<"abc"> abc\< ">abc\<" "abc"\<

    There are no doubt other awkward names one can devise; in particular it would be useful to generate similar names with non-ASCII characters in them to force the use of value triples.

    44b4c87c-fe02-447d-bbdd-58861e3ff3f2 commented 7 years ago

    This is actually a bug in collapse_rfc2231_value: bpo-29020

    44b4c87c-fe02-447d-bbdd-58861e3ff3f2 commented 7 years ago

    The patch is incomplete.

    There is another unquote call:

    336 return unquote(text)