Closed johannessen closed 2 years ago
Fixed. Thanks for the report. And thanks to @rjbs for the nudge.
https://www.nntp.perl.org/group/perl.copenhagen/2015/11.html now has significantly more ø's.
-R
Thanks for looking into this! It seems that a fix for the more serious 2nd example has not yet been deployed:
https://www.nntp.perl.org/group/perl.perl5.porters/2022/02/msg263138.html still has no subject visible, making links to that article (and several others) unclickable in the thread view at the end.
That's a different problem:
> select h_subject from articles where id=263138 and group_id=60;
+-----------+
| h_subject |
+-----------+
| |
+-----------+
1 row in set (0.086 sec)
I'm guessing the issue is with the header unfolding code here: https://github.com/abh/colobus/blob/master/colobus-archive#L176
Last touched 12 years ago!
I can try and poke at it, time permitting, although I'm tempted to just do:
$subject = "Unknown Subject" unless $subject;
PR sent. I'll ponder data backfill. There are ~541 cases.
"Unknown subject" would certainly work around the UI navigation issue just fine.
Looking at the Colobus source, I’m pretty sure you’re exactly right about the header unfolding being the cause of this other problem. Specifically, the regex in line 197 fails to recognise header field bodies that begin with folding white space, which is in fact allowed at least since RFC 2822.
I’d expect the following patch to fix it:
- if ($line =~ m/^$header: (.+?)\r?\n/is) {
+ if ($line =~ m/^$header: ?(.*?)\r?\n/is) {
Ah, you beat me to it 😄 great!
I actually like your patch better than mine.
I like your *
, so I'm putting that in. :)
I repaired ~530 messages who had broken subjects. There are ~11 more that can't be repaired because what's in the archive isn't a valid email or is otherwise messy.
Hi Arne, If you are up for trying to diagnose and fix it I can get you a dump of the index database and the raw archives for that list.
Hi Ask Bjørn, I haven’t experienced this issue lately. I think it was fixed with ae3a56d9b26d4642ac3ed52b383bd55d11273535 and abh/colobus#2. Do you have an example of a message for which it still exists?
Message subjects that are Q-encoded according to RFC 2047 because they contain non-ASCII characters are not decoded correctly. See the following messages as an example:
For the first message, the subject header is shown in undecoded form. However, for the second message, the header is missing entirely. This is a particular problem because the web app’s thread view uses subject headers as text for hyperlinks, which become unusable as a result.
In the source of the above messages as I received them over SMTP, the headers looks like this:
In the second case, I initially suspected Thunderbird’s Q-encoding, which uses lowercase hex chars. This is not recommended according to the RFC. However, Encode::MIME::Header (used by Email::MIME, which is in turn used by CN::Model::Article) can actually handle this just fine:
Also, the first message looks like decoding wasn’t even attempted, so I’m not sure it would make sense to assume that a decoding failure is what causes the header to be empty for the second message.
Don’t know what’s going on here.