karastojko / mailio

mailio is a cross platform C++ library for MIME format and SMTP, POP3 and IMAP protocols. It is based on standard C++ 17 and Boost library.
Other
374 stars 98 forks source link

GMail: parsing headers with value sometimes fails #89

Closed diegoiast closed 2 years ago

diegoiast commented 2 years ago

https://github.com/karastojko/mailio/blob/67f8d23b860b62d3ae656056926b0d95b2a6d1c2/src/mime.cpp#L741

I am getting emails with have content encoded in utf-7/cp1255 - that look like this: "Delivered-To: diegoiast@gmail.com",

The assert happens on the value, which is not UTF8. Should we remove the call to codec::is_utf8_string() ?

I also found that the same function fails, with header_value = "=?windows-1255?Q?=F6=E5=E5=FA_=F9=EC_=F8=E1=F0=E9=ED_=EE=EE=FA=E9=EF?=\t=?windows-1255?Q?_=EC=EA_=F2=ED_=EB=EC_=E4=FA=F9=E5=E1=E5=FA_=EC=F9?=\t=?windows-1255?Q?=E0=EC=E5=FA_=F9=EE=E8=F8=E9=E3=E5=FA_=E0=... (encoded cp1255, as utf7 if I am not mistaking),

How about putting these asserts under _strict_mode?

diegoiast commented 2 years ago

New data:

  1. it seems some clients send other 8bit encodings, so the assumption that the email is latin 1 is just not true. See https://github.com/karastojko/mailio/pull/91
  2. I see some email without any senders. These are spammers which send... invalid messages. I saw one with the sender set to recipients "undisclosed recipients: ;" - which kills the parser.
  3. I see as recipients this text, which again kills the parser: recipients ""\"Sales & Pre Sale / =?UTF-8?Q?=D7=9E=D7=9B=D7=99=D7=A8=D7=95=D7=AA=22=20?= =?UTF-8?Q?=3Cupgrades=40hostdime=2Eco=2Eil=3E?=\"@expansion.hostdime.com"
  4. I see quoted printable failed on "=D7=93=D7=99=D7=90=D7=92=D7=95_=D7=99=D7=A1=D7?==?utf-8?Q?=98=D7=A8=D7=95=D7=91=D7=A0=D7=99"
  5. parse_header_value_attributes() fail on this value: "application/octet-stream; name=217093469\\container_0_LOGO"
  6. parse_header_value() fails on this value: "attachment;filename*=utf-8''%D7%94%D7%A1%D7%9B%D7%9D%20%D7%A9%D7%9B%D7%99%D7%A8%D7%95%D7%AA%20%2D%20%D7%94%D7%95%D7%93%20%D7%94%D7%A9%D7%A8%D7%95%D7%9F.pdf"
  7. parse_header_line() fails on "Content-Type: text/plain; charset = \"UTF-8\"" I am unsure why :)
  8. parse_header_value(): fails on "video/x-ms-wmv; name=\"=?UTF-8?B?17PCs9aywrPXssKy1rLCs9ezwrPWssKz17LCstay?= =?UTF-8?B?wrPXs8Kz1rLCs9eywrLWssKzINezwrPWssKz17LCsg==?= =?UTF-8?B?1rLCsyDXs8Kz1rLCs9eywrLWssKz17PCs9aywrM=?= =?UTF-8?B?17LCs...
  9. I saw a message with a basly encodede base64 file - again, crashes the parser.
  10. parse_address_list() fails while reading this kind of email (probably a missconfigured client, again): aaa.bbb@gmail.com<info@aaa.net>
  11. parse_header_value_name() fails with this subject: "=?windows-1255?Q?=F6=E5=E5=FA_=F9=EC_=F8=E1=F0=E9=ED_=EE=EE=FA=E9=EF?=\t=?windows-1255?Q?_=EC=EA_=F2=ED_=EB=EC_=E4=FA=F9=E5=E1=E5=FA_=EC=F9?=\t=?windows-1255?Q?=E0=EC=E5=FA_=F9=EE=E8=F8=E9=E3=E5=FA_=E0=...
karastojko commented 2 years ago

Thanks a lot. I will try to address the issues in the following period. I believe most of them are violating the RFC, so probably I'd put them under the non-strict mode.

I will comment here the points as I am processing them.

(2) Fix is for the non-strict mode. The quotes are excluded (recipients undisclosed recipients: ; would be extracted) since I am trying to keep up with the LL(1) parsing (although even that is not completely true). (4) The sample looks incorrect because the Q encoding starting/stopping delimiters (=?utf-8 and ?=) are in the middle of the string. I believe it should be something like =?utf-8?Q?=D7=93=D7=99=D7=90=D7=92=D7=95_=D7=99=D7=A1?=. This is decoded to דיאגו יס. Since this is Hebrew which is written from right to left, maybe the email client reversed the order of the Q codec delimiters and on merge put them in the middle. Thus, no mailio fix could be applied here. (5) Backslashes in the attribute value is not allowed by the RFC2045, section 5.1. The fix is for the non-strict mode. (7) This is a violation of the RFC2045, section 5.1. The whitespaces around the equal character are not allowed. Here is the fix for the non-strict mode. (9) I need an example for this. (10) RFC5322, section 3.4.1 does not allow the address format in the name part (monkey is redundant), thus the fix goes to the non-strict mode.

diegoiast commented 2 years ago

note how many of the fails are regarding to rfc2047, look at section 2, 3 and 4 of that RFC - this defines how to handle these kinds of non-latin on strings. How are you going to handle this?

I would prefer to get from this library the latin1/char raw string - and that the library will not even try to parse this. I would like another Same for IMAP - I would like to be able to get the raw `char/std:::string` the remote server sends, (so I can cache it verbatim locally) and replit the parsing to another layer (or different sub/library).

But this is another issue, all together (relevant: https://www.youtube.com/watch?v=3qNtyfZP8bE).

diegoiast commented 2 years ago

Closing this issue, I will re-open a new issue on encodings issues. Master is much more stable now.