markov2 / perl5-Mail-Message

Processing MIME messages
http://perl.overmeer.net/CPAN
1 stars 1 forks source link

Changing utf8 and us-ascii into utf-8 #12

Closed markov2 closed 1 year ago

markov2 commented 1 year ago

(Was issue #8 question 4)

The existing code used us-ascii, utf8 and utf-8 as default charsets in different places. I’ve changed them to all be utf-8 (with a hyphen), which is Perl’s implementation of the official Unicode standard. Decoding as utf8 (no hyphen; Perl’s loose encoding) can produce non-character code points, which could expose bugs or security issues in downstream code. Also, using utf-8 consistently means that a call to encode() will not decode and encode the character set encoding unnecessarily (e.g. for content transfer decoding or encoding as mentioned above). I realize that messages without a charset parameter should be US-ASCII, but in the real world, they often have non-ASCII characters, and utf-8 is increasingly often the best guess for what the charset really is.

markov2 commented 1 year ago

You are correct in some of the changes, but not all. Where we speak about the charset as attribute to transmitted messages, we should strictly use utf-8. When we have a string in Perl where we apply actions on, we can better use utf8 as representation. See patch [f9556ad]

The interpretation of what Perl does with charsets is mispresented in your description. It is very important to understand the nasty hazards well, before changing what I have implemented. Perl does not have a "raw bytes" type: it only has strings with or without the utf8-flag set. When you expect you can treat a string as bytes, you will get punished: operations, like regular expression matches, may accidentally upgrade your "bytes" into utf8. It is really easy to get into double encoding problems.

Treating bytes without charset as utf-8 produces even worse results than treating them as us-ascii. The latter, because Perl's non-utf8 strings are cp-1252 (Windows 1252), which has most characters in common with Latin1. From the messages which I see which do "forget" their charset, quite a number contain ö or ß. Yeh, I know, European languages.

I would like to introduce autodetection of utf-8. For instance, use Encode::Guess when the charset is undefined. What do you think?

jbalazerpfpt commented 1 year ago

I don't think you and I have any disagreement on utf8 vs. utf-8 charsets or the meaning of Perl character strings and the utf8 flag. Patch f9556ad looks fine.

Autodetection of utf-8 may be useful for some users, though as I described, not for us, because we have our own autodetection and decoding routines that handle many more encodings and problems. The world continues to get better with its migration to utf-8, so I would prioritize decoding that correctly over mislabeled legacy Latin encodings.

markov2 commented 1 year ago

I have literally spent weeks to think about the best backwards compatible solutions. Actually, I hope I found a non-backward breaking was of solving this.

Probably, you can simplify your code via the new Mail::Message::Body::Decode->charsetDetectAlgorithm()

Can you please test the changes in the repo with your regression test-set? There will be changes. I may also produce a test release for you, in case that's simpler.

jbalazerpfpt commented 1 year ago

Thanks, Mark. We periodically incorporate new versions of Mail-Message, so when that happens I will take steps to either patch the code for our needs or rework our code to utilize new API features. We don't have an upgrade scheduled, but when it happens all of the code will be tested.

We have extensive charset decoding heuristics and detection which operate on the text after decoding of the transfer encoding. So I don't imagine we will have a use for charsetDetectAlgorithm(). We always do our own charset decoding.

markov2 commented 1 year ago

I had hope that you could test the changes before I release it to the public, because they may break other people's instances as well. But you are unable to do it?

Of course, you can proceed with your existing code, but my prediction is that the use of charsetDetectAlgorithm() would really simplify your code.

markov2 commented 1 year ago

Supported since 3.013