Transfer-encoding with unspecified charset

markov2 commented 1 year ago

(Was issue #8 question 3)

In Mail-Message 2.x, you could call encode() with a transfer_encoding option but no charset option as a way of decoding or encoding the content transfer encoding, while not decoding or encoding the character set encoding. This functionality didn’t work in Mail-Message 3.x. Our code frequently makes use of this when we want to do our own character set decoding or encoding, outside what Mail-Message provides.

markov2 commented 1 year ago

Question: what makes your charset encoding/decoding so special that Perl cannot handle it? I have totally no idea what you can do more.

jbalazerpfpt commented 1 year ago

We have routines to detect the character set encoding of a message part body, and do the decoding. It has a number of heuristics and rules to make sense of commonly mislabeled cases, and fallback options for when decoding fails. These problems are surprisingly common in email. To give you a few examples:

UTF-8 bodies having no charset parameter (this is the most common problem)
Extended charsets like big5plus or GB18030 being labeled as the base charset
All manner of single-byte extended ASCII charsets having no charset parameter
UTF-16 message bodies with no byte order mark and no indication of the byte order in the charset parameter value

So we don't rely on Mail-Message to do the charset decoding. We just ask it to decode the content-transfer-encoding and then we decode the charset ourselves. It's all Perl code, but with lots of additional logic.

Then for writing the output, it's a similar story. We encode the charset ourselves, and ask Mail-Message to do the content-transfer-encoding when we create a Mail::Message::Body object, but not the charset encoding. Again, it's Perl code, but with additional logic for necessarily clean-up, e.g., removing characters that can't be encoded or selecting a different encoding, selecting the content-transfer-encoding, or adding byte order marks where necessary. For encoding, it's not so much that Mail-Message can't do the charset encoding for us, but rather that we built our code around a Mail-Message 2.x that gave us the option of doing our own charset encoding.

markov2 / perl5-Mail-Message

Transfer-encoding with unspecified charset #11