rjbs / Email-MIME

perl library for parsing MIME messages
20 stars 30 forks source link

Fix the UTF8 issue in #18 #55

Closed dxdc closed 7 years ago

dxdc commented 7 years ago

This fixes the issue pointed out by @karel-m and @FGasper, which can be observed by running this code block multiple times:

use Email::MIME;
use Encode 'encode';
use utf8;

print Email::MIME->create(
    attributes => {
      disposition  => 'attachment',
      encoding     => 'base64',
      filename     => encode("MIME-Q", "kůň.pdf"),
      content_type => 'application/pdf',
      name         => encode("MIME-Q", "kůň.pdf"),
    },
    body => 'XXXX',
)->as_string;

As described in https://github.com/rjbs/Email-MIME/issues/18, it gives different results on subsequent runs, i.e.

... in the next run:
Content-Type: application/pdf; name="=?UTF-8?Q?k=C5=AF=C5=88=2Epdf?="
Content-Disposition: attachment; filename="kůň.pdf"

... in the next run (this one is IMO correct/expected):
Content-Type: application/pdf; name="=?UTF-8?Q?k=C5=AF=C5=88=2Epdf?="
Content-Disposition: attachment; filename="=?UTF-8?Q?k=C5=AF=C5=88=2Epdf?="

... in the next run:
Content-Type: application/pdf; name="kůň.pdf"
Content-Disposition: attachment; filename="=?UTF-8?Q?k=C5=AF=C5=88=2Epdf?="

This patch fixes it. It is caused by the fact that Perl hash keys are not ordered.

pali commented 7 years ago

Hi! Can you please describe in commit message (and also here on github) which UTF-8 issue it fixes and how? Because from this pull request it looks like it fixes some hash ordering to be deterministic and I have absolutely no idea how such thing is UTF-8 related.

Ideally with test example to show bug and that it was fixed.

And moreover this pull request does not fix issue #18 which say that "Email::MIME does not correctly escape the double quote." I do not see any implementation of escaping in this pull request. So probably it refer to another problem...

dxdc commented 7 years ago

Hi @pali ... have updated description above with test example :)

pali commented 7 years ago

... in the next run (this one is IMO correct/expected):

Content-Type: application/pdf; name="=?UTF-8?Q?k=C5=AF=C5=88=2Epdf?="
Content-Disposition: attachment; filename="=?UTF-8?Q?k=C5=AF=C5=88=2Epdf?="

As written in #18 this is incorrect too.

pali commented 7 years ago

Correct fix for #18 is in pull request #51

dxdc commented 7 years ago

@pali did you try the script? It gives alternating results as stated due to Perl hash order.

pali commented 7 years ago

I know that it gives different result based on perl hash order. But it does not fix #18 nor any other UTF-8 related issue.

I agree that such change to make behavior predicable is OK, but current description is irrelevant to this change.

dxdc commented 7 years ago

Sorry if there is confusion @pali. But, it does fix a UTF8 issue first mentioned in #18 as far as I know.

If I run this script repeatedly, I see the follow UTF8 errors in stderr which only happen sporadically. These are resolved after merging in this patch. I agree that the fix just takes advantage of the sort order of the keys, and there is probably a better/more permanent fix, but this does seem to work.

Unquoted '"' not allowed at test.pl line 5.
Missing semicolon before parameter '"kůň.pdf"' at test.pl line 5.
Wide character in print at test.pl line 5.
pali commented 7 years ago

Sorry if there is confusion @pali. But, it does fix a UTF8 issue first mentioned in #18 as far as I know.

Nope, it does not fix UTF-8 issue with Content-Type and Content-Disposition headers. As @rjbs (in https://github.com/rjbs/Email-MIME/issues/18#issuecomment-234929880) and also I (in https://github.com/rjbs/Email-MIME/pull/55#issuecomment-329805444) wrote, your following output is incorrect and wrong too.

Content-Type: application/pdf; name="=?UTF-8?Q?k=C5=AF=C5=88=2Epdf?="
Content-Disposition: attachment; filename="=?UTF-8?Q?k=C5=AF=C5=88=2Epdf?="

Correct output for filename kůň.pdf is:

Content-Type: application/pdf; name*=UTF-8''k%C5%AF%C5%88.pdf
Content-Disposition: attachment; filename*=UTF-8''k%C5%AF%C5%88.pdf
pali commented 7 years ago

Now I added above kůň.pdf test to PR #51 to explicitly verify that UTF-8 filename issue is fixed correctly. Hopefully other people would see correct output and would not try proposing something wrong again.

dxdc commented 7 years ago

@pali thank you for the clarification and adding the tests! I didn't understand before that the MIME-Q encoding wasn't correct for filename spec.

So, investigating this then, I still have an issue and suggestion.

  1. Suggestion: Reverse the order of filename and filename, so that filename comes after filename. This extends compatibility for clients that do not support filename*. See RFC6266.

  2. If I run the script below, the filename is not encoded into UTF8, as it does nicely in your tests. This is why the encode functions were used. Am I implementing this incorrectly?

print Email::MIME->create(
    attributes => {
      disposition  => 'attachment',
      encoding     => 'base64',
      filename     => "kůň.pdf",
      content_type => 'application/pdf',
      name         => "kůň.pdf",
    },
    body => 'XXXX',
)->as_string;

Output:

MIME-Version: 1.0
Content-Type: application/pdf; name="kůň.pdf"
Content-Disposition: attachment; filename="kůň.pdf"
Content-Transfer-Encoding: base64
pali commented 7 years ago
  1. hm... I have not find requirement of suggestion that filename* should be after filename. I think order is not important. Or is there any explicit statement that not? Also RFC6266 is for HTTP, not for emails.

  2. have you applied both patches? (one for Email-MIME and one for Email-MIME-ContentType)

dxdc commented 7 years ago
  1. I could only find this, from the link I sent, which suggests the order:

    Many user agent implementations predating this specification do not understand the "filename" parameter. Therefore, when both "filename" and "filename" are present in a single header field value, recipients SHOULD pick "filename" and ignore "filename". This way, senders can avoid special-casing specific user agents by sending both the more expressive "filename" parameter, and the "filename" parameter as fallback for legacy recipients (see Section 5 for an example).

...

Direct the UA to show "save as" dialog, with a filename containing the Unicode character U+20AC (EURO SIGN):

 Content-Disposition: attachment;
                      filename*= UTF-8''%e2%82%ac%20rates

Here, the encoding defined in [RFC5987] is also used to encode the non-ISO-8859-1 character.

This example is the same as the one above, but adding the "filename" parameter for compatibility with user agents not implementing RFC 5987:

 Content-Disposition: attachment;
                      filename="EURO rates";
                      filename*=utf-8''%e2%82%ac%20rates

Note: Those user agents that do not support the RFC 5987 encoding ignore "filename*" when it occurs after "filename".

I agree that most modern clients won't have an issue.

  1. Yes. Do you see a different result with that script on your system?
Email::MIME is up to date (1.946).
Email::MIME::ContentType is up to date (1.022).
pali commented 7 years ago
  1. In first cite part there is no information about order. In second cite part is example and I understood it as example in which both filename* and filename was present without any importance that it would not work when order is reversed... But in RFC2045 which is relevant for emails (RFC6266 is for http), there is a note in section https://tools.ietf.org/html/rfc2045#section-5

    After the media type and subtype names, the remainder of the header field is simply a set of parameters, specified in an attribute=value notation.  The ordering of parameters is not significant.
  2. Have you applied both patches from both pull requests?

    use utf8;                                                                                                                                                                                                                        
    use Email::MIME;                                                                                                                                                                                                                 
    print Email::MIME->create(                                                                                                                                                                                                       
    attributes => {                                                                                                                                                                                                              
      disposition  => 'attachment',                                                                                                                                                                                              
      encoding     => 'base64',                                                                                                                                                                                                  
      filename     => "kůň.pdf",                                                                                                                                                                                                 
      content_type => 'application/pdf',                                                                                                                                                                                         
      name         => "kůň.pdf",
    },
    body => 'XXXX',
    )->as_string;
    
    Date: Sun, 17 Sep 2017 08:35:42 +0200
    MIME-Version: 1.0
    Content-Type: application/pdf; name*=UTF-8''k%C5%AF%C5%88.pdf; name=kun.pdf
    Content-Disposition: attachment; filename*=UTF-8''k%C5%AF%C5%88.pdf;
    filename=kun.pdf
    Content-Transfer-Encoding: base64

WFhYWA==

dxdc commented 7 years ago

Thanks @pali working as expected now. Appreciate your explanation, will close this PR.