Interpretation of unspecified charset

markov2 commented 1 year ago

(Was Issue #8 question 1)

Mail::Message::Body 2.x supported body objects created with no charset parameter, but that support was removed in version 3.x. My patch restores that support. We make use of this in our environment, because we will take an email message received from the Internet, parse it, possibly decode, modify, and re-encode a part. Then, write out the whole message as a new message to be sent on to the recipient - and sometimes a message has a part with no charset parameter.

Of course, RFCs say a body with no charset parameter is supposed to be US-ASCII, but sometimes senders defy the RFCs and send messages using non-ASCII characters and no charset parameter. For these cases, wherever decoding of the character set encoding is not required, we prefer to preserve the original bytes and not make any attempt to decode or re-encode the character set encoding. Or, when decoding is required, we make our best guess about how to decode the charset encoding, but we want to write out the new message with no charset parameter when the original message didn't have one, so as to not commit our guess into the message and possibly corrupt its contents for the recipient if we happen to guess wrong.

My change allows that: when creating a body object with no charset, the data supplied must be bytes, which means bytes of some unknown character set encoding. So long as no operation is performed that requires decoding of the character set encoding, the bytes will be preserved. If decoding is required, utf-8 will be assumed. Support for the “PERL” charset is maintained, but it must be explicitly stated and is no longer the default.

markov2 commented 1 year ago

Mail::Message::Body 2.x supported body objects created with no charset parameter, but that support was removed in version 3.x. My patch restores that support.

Actually, before version 3, the charset was left in limbo, which was increasingly a problem: more and more utf8 was used. So, the changes are there to solve a problem. Now most people know how to produce € and ö on their keyboards, it's in the generic case unacceptable to use "unknown charset" logic.

MailBox is a library, so agnostic about the actual application. It tries to protect users from mistakes. Keeping the charset undefined is not an option for the library.

When you want to send messages which have their charset not corrected, you could remove the produced charset= in the message part explicitly. RFC rule: be gentle in what you accept, but strict with what you produce.
I think we should autodetect the charset when missing. Defaulting to us-ascii is not sufficient in reality. This should happen during decode() in the ::Body. A simple check for a byte larger than 0x80 would already be a great improvement.

Mostly, I see missing charset in attachments which are read from file: HTML or plain text footers. When it is HTML, we could also attempt to scan it for <meta charset>.

jbalazerpfpt commented 1 year ago

I take your point about being gentle with what you accept and strict with what you produce. But it's not a mail relay agent's job to fix all RFC compliance issues when relaying a message. That's what we're doing: accepting a message, making small modifications, and relaying the modified message. So for certain types of compliance problems in the input, we do need to be able to produce output that is not compliant, because we don't want to make things any worse. We don't want to guess the charset wrong and commit that guess to the output, because that can corrupt the message for the recipient. We prefer to do as little interpretation of the input as possible, preserving the input data as much as possible. It absolves us of blame for problems, putting the onus on the sender to be compliant.

We have our own charset detection and decoding routines, so when we read message body data with Mail::Message, all we ask it to do is to decode the content transfer encoding. Really then, it's just two special things that our use case requires:

The ability to decode a message body without decoding the charset
The ability to write a new message body without encoding the charset or declaring it in the Content-type header field.

I understand that you have your priorities for Mail-Message, and they may not be compatible with ours. If that's the case, we'll continue to patch or work around as necessary. I offer the suggestion and patch in the spirit of open source collaboration, in case you or the users of Mail-Message could benefit, but I totally understand if you opt not to take the suggestion. We remain grateful for the many things that Mail-Message does for us.

markov2 / perl5-Mail-Message

Interpretation of unspecified charset #9