Mail-Message header unfolding replaces tabs with spaces

jbalazerpfpt commented 1 year ago

Per RFC 2822 section 3.2.3, a header field can be folded by inserting CRLF before a white space character - a space or a tab. And to unfold a folded field, the CRLF should be removed. Mail-Message 3.x introduces a bug in the unfolding of folded header fields: if the folding white space is CRLF followed by a tab character, the unfolding would convert the tab to a space. This changes the message, with potential consequences. Header unfolding should remove the CRLF without changing the white space character that follows it.

I have fixes here for three files with the bug. The accompanying stripping of leading or trailing white space has also been rewritten in an optimized fashion that works without backtracking.

--- lib/Mail/Message/Field.pm      2022-12-18 18:46:22.733514100 -0800
+++ lib/Mail/Message/Field-mod.pm   2022-12-18 18:44:59.089131000 -0800
@@ -441,7 +441,7 @@
     my $wrap  = shift || $default_wrap_length;
     defined $line or $line = '';

-    $line    =~ s/\n\s/ /gms;            # Remove accidental folding
+    $line    =~ s/\n(\s)/$1/gms;            # Remove accidental folding
     return " \n" unless CORE::length($line);  # empty field

     my @folded;
@@ -473,9 +473,11 @@
 sub unfold($)
 {   my $string = $_[1];
     for($string)
-    {   s/\r?\n\s?/ /gs;  # remove FWS
-        s/^ +//;
-        s/ +$//;
+    {   s/\r?\n(\s)/$1/g;  # remove FWS
+        s/\r?\n/ /g;
+
+        # remove leading and trailing white space (optimized)
+        $string = (m/(\S(?:.*\S)?)/)[0] // '';
     }
     $string;
 }

--- lib/Mail/Message/Field/Full.pm       2022-12-17 17:27:26.265716100 -0800
+++ lib/Mail/Message/Field/Full-mod.pm      2022-12-18 18:36:44.796063000 -0800
@@ -122,11 +122,13 @@
     }

     $body = $self->foldedBody;
-    $body =~ s/^ //;

-       # remove FWS, also required within quoted strings.
-    $body =~ s/\r?\n\s?/ /g;
-    $body =~ s/ +$//;
+    # remove FWS, also required within quoted strings.
+    $body =~ s/\r?\n(\s)/$1/g;
+    $body =~ s/\r?\n/ /g;
+
+    # remove leading and trailing white space (optimized)
+    $body = ($body =~ m/(\S(?:.*\S)?)/)[0] // '';
     $body;
 }

--- lib/Mail/Message/Field/Structured.pm 2022-12-18 21:27:01.866583200 -0800
+++ lib/Mail/Message/Field/Structured-mod.pm     2022-12-18 21:28:52.263338000 -0800
@@ -71,8 +71,11 @@
 {   my ($self, $string) = @_;

     # remove FWS, even within quoted strings
-    $string =~ s/\r?\n\s?/ /gs;
-       $string =~ s/ +$//;
+    $string =~ s/\r?\n(\s)/$1/g;
+    $string =~ s/\r?\n/ /g;
+
+    # remove trailing white space (optimized)
+    $string = ($string =~ m/((?:.*\S)?)/)[0] // '';

     my $datum = '';
     while(length $string && substr($string, 0, 1) ne ';')

The original regex substitutions would also replace a CRLF that is NOT followed by white space. It is unclear to me why they would do that, since a CRLF is only allowed in the header field when it is part of folding white space. Nevertheless, I have preserved that behavior with explicit s/\r?\n/ /g substitutions. Perhaps they are unnecessary.

markov2 commented 1 year ago

RFC5322 has replaced 2822. https://www.rfc-editor.org/rfc/rfc5322#section-3.2.2 When you look at the headers, they often use tabs where a blank is sufficient, even in Subject folding. So reality differs from the intention of CFWS. That's why I changed the output a bit. But, I took that part of your changes.

Some applications sometimes produce lines which are too long. (Old) equipment breaks those lines into separate lines, which leads to \r\n without following white-space.

The last paragraph says:

Runs of FWS, comment, or CFWS that occur between lexical tokens in a structured header field are semantically interpreted as a single space character.

So, I do hope a change from 'space' into retaining the white-space in the last change does not break people's code.

You say "This changes the message, with potential consequences.". Still, after your changes, the message is still changed on leading and trailing white-spaces. So: why is that change acceptable?

This horrible

+    # remove trailing white space (optimized)
+   $body = ($body =~ m/(\S(?:.*\S)?)/)[0] // '';

is claimed to be faster than:

-    for($body) { s/^\s+//; s/\s$// }

I would like to see proof of that, before I add such horrible construct. Assuming that the line change is rarely applied.

jbalazerpfpt commented 1 year ago

The slowdown seen in s/\s$// was only seen in multi-threaded applications, so it may be a bug in Perl itself. I'm hearing that the issue may already be fixed in the latest version of Perl.

markov2 / perl5-Mail-Message

Mail-Message header unfolding replaces tabs with spaces #7