rjbs / Email-MIME

perl library for parsing MIME messages
20 stars 30 forks source link

Email address quotes are not preserved when phrase is not defined #52

Open atoomic opened 7 years ago

atoomic commented 7 years ago

Notice after updating Email::MIME from 1.940 to version 1.945 The email address '<>' quotes are not preserved if no phrase/name is set

one liner to show the issue with 1.945

$>perl526 -MEmail::MIME::Header -MEmail::MIME::Encode -E 'say Email::MIME::Encode::maybe_mime_encode_header("To", "<iwrestled\@abear.once>", "utf-8")'
    iwrestled@abear.once

$>perl526 -MEmail::MIME::Encode -E 'say Email::MIME::Encode::maybe_mime_encode_header("To", "<iwrestled\@abear.once>", "utf-8")'                      
    <iwrestled@abear.once>

After investigation it bisects to the commit ed7c2907ecca5ced43c4a50be6e3b4a74edfad7a which introduces the usage of Email::Address::XS instead of Email::Address. But Email::Address is not the source of the problem and the XS one is behaving the same way than the non XS one and preserve the brackets.

> perl -e 'use Email::Address; map { print $_->format . qq[\n] } Email::Address->parse( q[<from@mydomain.com>] );'
from@mydomain.com
> perl -e 'use Email::Address::XS; map { print $_->format . qq[\n] } Email::Address::XS->parse( q[<from@mydomain.com>] );'
from@mydomain.com

The problem comes from the shortcut in maybe_mime_encode_header which only occurs if Email::MIME::Header is loaded...

atoomic commented 7 years ago

commit 9107810 is adding a unit test to show the issue whereas 7986a6c is an attempt to add a naive fix

pali commented 7 years ago

Apparently this is more complicated problem. Function maybe_mime_encode_header is for encoding Unicode header to 8bit MIME. And for To header with value <iwrestled@abear.once>, correct MIME encoding is also iwrestled@abear.once.

Due to how MIME is working, for structures headers like addresses, it is first needed to fully decompose address-like header into internal structures, then every Unicode string needs to be encoded into 8bit MIME and after that header value is composed back to one structured string.

And it is better to guarantee RFC-compliant output which represent same input, rather then trying to hack around some special cases to not touch header, like in your case.

Your current bug report is just about different representation of encoded MIME header and both values have same meaning. There are plenty equivalent representation of your example (Unicode) input in MIME.

pali commented 7 years ago

So question is, why you want to hack maybe_mime_encode_header? If you already prepared correctly formatted 8bit MIME header, why not rather use header_raw_set function?

pali commented 7 years ago

Anyway, Email::MIME::Encode should load Email::MIME::Header as it uses variable $Email::MIME::Header::header_to_class_map.

atoomic commented 7 years ago

Very good question, in the existing code I maintain, Email::MIME::Encode::maybe_mime_encode_header is used to encode email headers. That portion of code only deal with header, I realize that it's probably incorrect as Email::MIME::Encode namespace is labelled as "private", using header_str on one Email::MIME object is probably more appropriate.

And Yes I agree this is not a bug, but a behavior change. Our main concern with this change is that this could affect the spam score from Mail::SpamAssassin, as an email without "" has a malus.

For your second message (hoped that github could handle reply to each message individually) I agree Email::MIME::Encode should always load Email::MIME::Header so it would provide a consistent behavior.

atoomic commented 7 years ago

Email::Simple::Header is probably the one providing all the abstraction to manipulate headers, but do not provide the encoding layer

pali commented 7 years ago

Very good question, in the existing code I maintain, Email::MIME::Encode::maybe_mime_encode_header is used to encode email headers.

Ah :-( Email::MIME::Encode is a private helper for MIME header encoding which is written in documentation https://metacpan.org/pod/Email::MIME::Encode but seems that it does not stop people to use it and then start seeing different behavior of private functions... Anyway, you are not alone some people are not stopped even when function starts with underline, which is perl way for private function https://github.com/rjbs/Email-MIME-ContentType/pull/6

So I would suggest you to stop using private functions of random modules as they can be changed at anytime. If you have really good reason for it, try to ask if such function cannot be marked as public, define stable API and important: write documentation for it.

as an email without "" has a malus.

What does it mean without ""? If whole problem of this reported issue is just because we generate something which is not good for Mail::SpamAssassin, then we could change generated output to be more compatible with Mail::SpamAssassin.

pali commented 7 years ago

Fix for Email::MIME::Encode to work also without explicit Email::MIME::Header loading is there: https://github.com/rjbs/Email-MIME/pull/54

atoomic commented 7 years ago

Email::MIME::Encode::maybe_mime_encode_header Thanks I've noted that this is for now a private function. Might consider asking to promote it to public with some doc changes & stable API.

as an email without "" has a malus. Sorry I mean an email without angle brackets <...>

Another idea would be patching Email::Address::XS (and also probably the non XS one) to add a boolean in the message_address struct [then use it] to know if angle brackets were used in the original version then preserve it if it were used ?

We could either make the default behavior to preserve the angle brackets or make it optional, so Email::MIME could set it or not depending on the option.

Problem is this would require patching the upstream dovecot code


diff --git a/Email-Address-XS/dovecot-parser.h b/Email-Address-XS/dovecot-parser.h
index b0263543d..d03a7ba1d 100644
--- a/Email-Address-XS/dovecot-parser.h
+++ b/Email-Address-XS/dovecot-parser.h
@@ -19,6 +19,7 @@ struct message_address {
        char *comment;
        /* there were errors when parsing this address */
        bool invalid_syntax;
+     bool use_anglebrackets;
 };
pali commented 7 years ago

Email::MIME::Encode::maybe_mime_encode_header

Thanks I've noted that this is for now a private function. Might consider asking to promote it to public with some doc changes & stable API.

I'm not very happy for making this module public for another modules. What is your purpose? Or better question, why is not Email::MIME enough for you? Is there a bug or missing feature in Email::MIME?

Another idea would be patching Email::Address::XS (and also probably the non XS one)

Email::Address is dead and contains security issues. Rather do not use it on untrusted input. I have example of emails which cause DOS on servers which uses Email::Address->parse(). Reason why I created Email::Address::XS.

to add a boolean in the message_address struct [then use it] to know if angle brackets were used in the original version then preserve it if it were used ?

I do not like this idea. Two reasons:

1) If you create Email::Address::XS object in any way (via ->new or via ->parse or via some parse_* function), then format method would depend only on tokens (user, host, address, phrase, comment) and when tokens are same, then it always provides same output. This is something important -- to provide unified, deterministic and immutable behavior.

2) Original string passed to ->parse method still would be different from output of ->format function. E.g. any comments in phrase or email address are dropped, phrase is simplified, etc... So it does not make sense to add some special case for angle brackets to have some different behavior.

See example:

$ perl -Iblib/lib -Iblib/arch -MEmail::Address::XS -E 'say Email::Address::XS->parse(q((comment) "Name" (comment) <a.(comment)b@host.com> (last comment)))->format()'
Name <a.b@host.com> (last comment)

There are a couple of ways how to write into header address with user='a.b', host='host.com', phrase='Name' and with comment after address '(last comment)'. Format function choose one which is not so complicated, valid according to RFC2822 and is not hard to parse.

In new version of Email::Address::XS there would be method original which like in Email::Address would return original string used for parsing.

Maybe now for you it looks like it could be used in Email::MIME. But, this would not work if you modify some member of Email::Address::XS -- e.g. any MIME encoding/decoding.

Look at example: <iwrestled@abear.once> - this does not need MIME re-encoding. But e.g. IDN <iwrestled@abéár.once> needs it. And personally I do not like idea that for IDN domains output string would be different as without IDN. (IDN is not supported yet, but pull request is prepared). In my opinion such behavior would cause more problems and specially for those who in most cases uses plain ASCII strings.

So my suggestion is:

If you need to preformat some email header, then use method header_raw_set. Here it is guaranteed that it would not be MIME-reencoded or modified. But then you must correctly MIME encode it. Method header_str_set is really mean to set Unicode string value and cause lot of problems for structured headers. This is also reason why support for header objects comes into Email::MIME.

atoomic commented 7 years ago

First thanks for your long & precise answer, and sorry for my late answer on this topic.

There are multiple reasons we are not using the full version of Email::MIME:

here are some basic metrics [this is mainly working on linux system - not stable on macosx]

> perl -MEmail::MIME -e 'print qx{grep RSS /proc/$$/status}'
VmRSS:      6980 kB

> perl -MEmail::MIME::Encode -e 'print qx{grep RSS /proc/$$/status}'
VmRSS:      5584 kB

> perl -MHomeMade::Email::Object -e 'print qx{grep RSS /proc/$$/status}'
VmRSS:      3240 kB

I trend to share your point of view where we should not use a private method in our codebase, and either use the whole enchilada or not... [ I've just opened an internal case for this ].

I also agree that both syntax are corrects. The root problem we were trying to solve here was to preserve the <quotes> to avoid a malus in the SpamAssassin score. This should probably be fixed upstream.

We were trying to add a rule saying: if the original email was using <quotes> then preserve them in the final version of the email.

From my point of view, we can close this ticket, I was mainly pointing to a behavior change. The commit in #54 would at least make it behave consistent.

pali commented 6 years ago

Anyway, have you reported bug to SpamAssassin that prediction of spam based on missing <quotes> is wrong? I think this is something which needs to be fixed...