Hello, Mark. I'm not sure if this qualifies as a feature request or a bug fix, but I'll try to explain our use case and why it was necessary for us to make these changes. I’m including a pair of patches for your consideration. They fix several issues:
Mail::Message::Body 2.x supported body objects created with no charset parameter, but that support was removed in version 3.x. My patch restores that support. We make use of this in our environment, because we will take an email message received from the Internet, parse it, possibly decode, modify, and re-encode a part, and write out the whole message as a new message to be sent on to the recipient - and sometimes a message has a part with no charset parameter. Of course RFCs say a body with no charset parameter is supposed to be US-ASCII, but sometimes senders defy the RFCs and send messages using non-ASCII characters and no charset parameter. For these cases, wherever decoding of the character set encoding is not required, we prefer to preserve the original bytes and not make any attempt to decode or re-encode the character set encoding. Or when decoding is required, we make our best guess about how to decode the charset encoding, but we want to write out the new message with no charset parameter when the original message didn't have one, so as to not commit our guess into the message and possibly corrupt its contents for the recipient if we happen to guess wrong. My change allows that: when creating a body object with no charset, the data supplied must be bytes, which means bytes of some unknown character set encoding. So long as no operation is performed that requires decoding of the character set encoding, the bytes will be preserved. If decoding is required, utf-8 will be assumed. Support for the “PERL” charset is maintained, but it must be explicitly stated and is no longer the default.
In some cases with Mail-Message 3.011 a Mail::Message::Body object could be created with no charset, and then when output by encoded() it would actually be written with charset=”PERL” in the Content-Type header, which is never what you want. My changes fix that.
In Mail-Message 2.x you could call encode() with a transfer_encoding option but no charset option as a way of decoding or encoding the content transfer encoding while not decoding or encoding the character set encoding. This functionality didn’t work in Mail-Message 3.x but is restored by my change. Our code frequently makes use of this when we want to do our own character set decoding or encoding, outside of what Mail-Message provides.
The existing code used us-ascii, utf8 and utf-8 as default charsets in different places. I’ve changed them to all be utf-8 (with a hyphen), which is Perl’s implementation of the official Unicode standard. Decoding as utf8 (no hyphen; Perl’s loose encoding) can produce non-character code points, which could expose bugs or security issues in downstream code. Also, using utf-8 consistently means that a call to encode() will not decode and encode the character set encoding unnecessarily (e.g. for content transfer decoding or encoding as mentioned above). I realize that messages without a charset parameter should be US-ASCII, but in the real world, they often have non-ASCII characters, and utf-8 is increasingly often the best guess for what the charset really is.
--- lib/Mail/Message/Body.pm 2022-12-17 14:06:20.000000000 -0800
+++ lib/Mail/Message/Body-mod.pm 2022-12-17 14:54:31.000000000 -0800
@@ -136,7 +136,9 @@
$mime ||= 'text/plain';
$mime = $self->type($mime);
- $mime->attribute(charset => ($charset || 'PERL'))
+ # Allow undefined charset: it will default to utf-8 if decoding
+ # is necessary.
+ $mime->attribute(charset => ($charset || undef))
if $mime =~ m!^text/!i && !$mime->attribute('charset');
$self->transferEncoding($transfer) if defined $transfer;
Wow, huge issue description with many complications. Let me split them up into separate issues, so they can be discussed. #9, #10, #11, and #12. Please participate in those four.
Hello, Mark. I'm not sure if this qualifies as a feature request or a bug fix, but I'll try to explain our use case and why it was necessary for us to make these changes. I’m including a pair of patches for your consideration. They fix several issues:
encoded()
it would actually be written withcharset=”PERL”
in the Content-Type header, which is never what you want. My changes fix that.encode()
with atransfer_encoding
option but nocharset
option as a way of decoding or encoding the content transfer encoding while not decoding or encoding the character set encoding. This functionality didn’t work in Mail-Message 3.x but is restored by my change. Our code frequently makes use of this when we want to do our own character set decoding or encoding, outside of what Mail-Message provides.us-ascii
,utf8
andutf-8
as default charsets in different places. I’ve changed them to all beutf-8
(with a hyphen), which is Perl’s implementation of the official Unicode standard. Decoding asutf8
(no hyphen; Perl’s loose encoding) can produce non-character code points, which could expose bugs or security issues in downstream code. Also, usingutf-8
consistently means that a call toencode()
will not decode and encode the character set encoding unnecessarily (e.g. for content transfer decoding or encoding as mentioned above). I realize that messages without a charset parameter should be US-ASCII, but in the real world, they often have non-ASCII characters, and utf-8 is increasingly often the best guess for what the charset really is.