Mail::Message::Field::Full::consumeComment() doesn't handle unterminated comments correctly

jbalazerpfpt commented 1 year ago

Hello, Mark. This is a follow-up to yyang's fix in https://rt.cpan.org/Public/Bug/Display.html?id=140879 . That fix didn't work correctly. I believe this fix works correctly:

--- lib/Mail/Message/Field/Full-3.012.pm       2022-12-17 17:27:26.265716100 -0800
+++ lib/Mail/Message/Field/Full-3.012-mod-consumeComment.pm    2022-12-17 17:35:08.855261000 -0800
@@ -375,20 +375,24 @@

 sub consumeComment($)
 {   my ($thing, $string) = @_;
+    # To avoid an infinite loop in the caller, if $string starts with '(' make
+    # sure we consume and return a comment, even if the comment is unterminated

     return (undef, $string)
-        unless $string =~ s/^\s*\(((?:[^)\\]+|\\.)*)\)//;
+        unless $string =~ s/^\s*\(((?:[^)\\]+|\\(?:.|$))*)\)?//;

     my $comment = $1;
+
+    # Add comment parts until the nested comment parentheses are balanced
+    # or there's no string left to consume
     while(1)
     {   (my $count = $comment) =~ s/\\./xx/g;

-        last if $count =~ tr/(//  ==  $count =~ tr/)//;
-
-        return (undef, $_[1])
-            unless $string =~ s/^((?:[^)\\]+|\\.)*)\)//;
+        last if ($count =~ tr/(//  ==  $count =~ tr/)//) || !length($string);

-        $comment .= ')'.$1;
+        if ($string =~ s/^((?:[^)\\]+|\\(:?.|$))*)\)?//) {
+            $comment .= ')'.$1;
+        }
     }

     $comment =~ s/\\([()])/$1/g;

markov2 commented 1 year ago

Do you have some examples, which I can use in the regression scripts?

-        unless $string =~ s/^\s*\(((?:[^)\\]+|\\.)*)\)//;
+        unless $string =~ s/^\s*\(((?:[^)\\]+|\\(?:.|$))*)\)?//;

It looks like your expression is seriously broken. \\(.|$) is not the same as \\. | \\$, because $ is interpreted here as "end of line". When you mean a $ as character, then it is already included in \\.. When you intended to express "end-of-line", it should be

+        unless $string =~ s/^\s* \( ((?:[^)\\]+|\\.)* )  (?:\)|$) //x;

But this would probably hang on (abc\. Also, I think the expression does not handle '()' well. Next attempt:

+        unless $string =~ s/^\s* \( ( (?: \\. | [^)] )* )  (?: \) | $ ) //x;

Agree?

markov2 commented 1 year ago

Actually, what you wish is a different default: instead of getting an error when the comment is not closed, it should consume the rest of the string. I have not seen an endless loop without that change. My new version for this method is:

sub consumeComment($)
{   my ($thing, $string) = @_;

    return (undef, $string)
        unless $string =~ s/^\s* \( ((?:\\.|[^)])*) (?:\)|$) //x;
        # allow unterminated comments

    my $comment = $1;

    # Continue consuming characters until we have balanced parens, for
    # nested comments which are permitted.
    while(1)
    {   (my $count = $comment) =~ s/\\./xx/g;
        last if +( $count =~ tr/(// ) == ( $count =~ tr/)// );

        last if $string !~ s/^((?:\\.|[^)])*) \)//x;  # cannot satisfy

        $comment .= ')'.$1;
    }

    for($comment)
    {   s/^\s+//;
        s/\s+$//;

        # Although nested comments are supported without backslash, not
        # everyone may know that.  Backslashes are officially even not
        # permitted in comments.  Remove backslashes before nested comment.
        s/\\ ( [()] )/$1/gx;
    }

    ($comment, $string);
}

Better now?

jbalazerpfpt commented 1 year ago

In Mail-Message 3.011, the infinite loop occurred in these lines of Mail::Message::Field::Structured::parse():

    while(length $string && substr($string, 0, 1) ne ';')
    {   (undef, $string)  = $self->consumeComment($string);
        $datum .= $1 if $string =~ s/^([^;(]+)//;
    }

The loop removes text from the beginning of $string until it hits a left parenthesis, and then relies on consumeComment to remove the comment. But if the comment can't be removed because it is unterminated, no more text is removed from the beginning of $string, causing an infinite loop. My test 1 below triggers an infinite loop. consumeComment should consume every comment that starts with a left parenthesis, even if it is unterminated. It should consume text to the end of the string, if necessary.

Here are some headers I used to test my fix. For tests 1-9, a comment should be consumed from (two to the end of the line.

Content-Type: 1: one \(two\) (three)
Content-Type: 2: one (two (three) four)
Content-Type: 3: one (two (three) four
Content-Type: 4: one (two \( (three) four)
Content-Type: 5: one (two (three) \) four)
Content-Type: 6: one (two (three) four \) five
Content-Type: 7: one (two \
Content-Type: 8: one (two (three) four \
Content-Type: 9: one (two (three (four) five
Content-Type: 10: one ()

Your latest proposed code avoids an infinite loop, but isn't correct for my tests 3, 6, 8, and 9. The problem is inside the while loop. It should consume to the end of the string when a right parenthesis is not found, and exit the loop when there is no more string to consume, like this:

    while(1)
    {   (my $count = $comment) =~ s/\\./xx/g;
        last if +( $count =~ tr/(// ) == ( $count =~ tr/)// ) || !length($string);

        last if $string !~ s/^((?:\\.|[^)])*) (?:\)|$) //x;

        $comment .= ')'.$1;
    }

The $ in that regex is important. Without it, if a right parenthesis isn't found and the string contains \), the regex can backtrack until [^)] matches the backslash, at which point the right parenthesis is matched as if it terminated the comment instead of being recognized as part of a quoted-pair. Test 6 shows that. With the $, the regex always matches, even for the empty string, so the last if isn't strictly necessary in that line anymore. But I am inclined to keep the last if so as to avoid an infinite loop in case of bugs. Some safety is called for inside any while(1) loop. The exit condition for an empty string is important, too, or else the loop will keep adding false right parentheses until they are balanced. Test 9 shows that. Otherwise your proposed code looks good to me.

But I still believe my originally proposed patch works correctly, and it's now been well tested in our QA and production environments. From my original patch, \$?:.|$) matches a quoted-pair or a lone backslash at the end of the string. $? matches the right parenthesis that terminates the comment, or it matches the empty string, allowing the string to be consumed to the end. The function consumes () just fine. And by using a character class that won't match a backslash, [^)\\], I eliminate the possibility of backtracking. While a different regex that avoids backtracking in other ways may be just fine, being explicit reduces the chances for there to be an unseen bug.

markov2 commented 1 year ago

I made some changes to the parsing of comments wrt illegal characters. Phrases which are formatted RFC2822 obs-phrase will be consumed better.

Maybe, you can extend the test-script with the real output you are expecting (I do want to survive broken fields, but not repair them to all cost)

markov2 / perl5-Mail-Message

Mail::Message::Field::Full::consumeComment() doesn't handle unterminated comments correctly #6