perlorg / cnntp

www.nntp.perl.org site
5 stars 4 forks source link

Q-encoded subject headers are not handled #4

Closed johannessen closed 2 years ago

johannessen commented 2 years ago

Message subjects that are Q-encoded according to RFC 2047 because they contain non-ASCII characters are not decoded correctly. See the following messages as an example:

  1. https://www.nntp.perl.org/group/perl.perl5.porters/;msgid=C4FDD8A1-D380-41CE-8328-9B8322924F37@felipegasper.com
  2. https://www.nntp.perl.org/group/perl.perl5.porters/;msgid=09046984-2ff2-c5a7-105b-a20afa061809@khwilliamson.com

For the first message, the subject header is shown in undecoded form. However, for the second message, the header is missing entirely. This is a particular problem because the web app’s thread view uses subject headers as text for hyperlinks, which become unusable as a result.

In the source of the above messages as I received them over SMTP, the headers looks like this:

Subject: =?utf-8?Q?Karl=E2=80=99s_auto-detect_WAS_Re=3A_tightening_up_sour?=
 =?utf-8?Q?ce_code_encoding_semantics?=
Message-Id: <C4FDD8A1-D380-41CE-8328-9B8322924F37@felipegasper.com>
X-Mailer: Apple Mail (2.3654.120.0.1.13)
Message-ID: <09046984-2ff2-c5a7-105b-a20afa061809@khwilliamson.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.5.0
Subject:
 =?UTF-8?Q?Re=3a_Karl=e2=80=99s_auto-detect_WAS_Re=3a_tightening_up_?=
 =?UTF-8?Q?source_code_encoding_semantics?=

In the second case, I initially suspected Thunderbird’s Q-encoding, which uses lowercase hex chars. This is not recommended according to the RFC. However, Encode::MIME::Header (used by Email::MIME, which is in turn used by CN::Model::Article) can actually handle this just fine:

use Encode;
my $s = "=?UTF-8?Q?Re=3a_Karl=e2=80=99s_auto-detect_WAS_Re=3a_tightening_up_?=";
say encode "UTF-8" decode "MIME-Header", $s;
# Re: Karl’s auto-detect WAS Re: tightening up

Also, the first message looks like decoding wasn’t even attempted, so I’m not sure it would make sense to assume that a decoding failure is what causes the header to be empty for the second message.

Don’t know what’s going on here.

rspier commented 2 years ago

Fixed. Thanks for the report. And thanks to @rjbs for the nudge.

https://www.nntp.perl.org/group/perl.copenhagen/2015/11.html now has significantly more ø's.

-R

johannessen commented 2 years ago

Thanks for looking into this! It seems that a fix for the more serious 2nd example has not yet been deployed:

https://www.nntp.perl.org/group/perl.perl5.porters/2022/02/msg263138.html still has no subject visible, making links to that article (and several others) unclickable in the thread view at the end.

rspier commented 2 years ago

That's a different problem:

> select h_subject from articles where id=263138 and group_id=60;
+-----------+
| h_subject |
+-----------+
|           |
+-----------+
1 row in set (0.086 sec)

I'm guessing the issue is with the header unfolding code here: https://github.com/abh/colobus/blob/master/colobus-archive#L176

Last touched 12 years ago!

I can try and poke at it, time permitting, although I'm tempted to just do:

$subject = "Unknown Subject" unless $subject;
rspier commented 2 years ago

PR sent. I'll ponder data backfill. There are ~541 cases.

johannessen commented 2 years ago

"Unknown subject" would certainly work around the UI navigation issue just fine.

Looking at the Colobus source, I’m pretty sure you’re exactly right about the header unfolding being the cause of this other problem. Specifically, the regex in line 197 fails to recognise header field bodies that begin with folding white space, which is in fact allowed at least since RFC 2822.

I’d expect the following patch to fix it:

-   if ($line =~ m/^$header: (.+?)\r?\n/is) {
+   if ($line =~ m/^$header: ?(.*?)\r?\n/is) {
johannessen commented 2 years ago

Ah, you beat me to it 😄 great!

I actually like your patch better than mine.

rspier commented 2 years ago

I like your *, so I'm putting that in. :)

rspier commented 2 years ago

I repaired ~530 messages who had broken subjects. There are ~11 more that can't be repaired because what's in the archive isn't a valid email or is otherwise messy.

abh commented 2 years ago

Hi Arne, If you are up for trying to diagnose and fix it I can get you a dump of the index database and the raw archives for that list.

johannessen commented 2 years ago

Hi Ask Bjørn, I haven’t experienced this issue lately. I think it was fixed with ae3a56d9b26d4642ac3ed52b383bd55d11273535 and abh/colobus#2. Do you have an example of a message for which it still exists?