libwww-perl / HTML-Parser

The HTML-Parser distribution is is a collection of modules that parse and extract information from HTML documents.
Other
6 stars 13 forks source link

Tokenizing bug. Some tokens are split into 2 #27

Open florian-pe opened 2 years ago

florian-pe commented 2 years ago

This problem happens on a particular webpage https://www.radiofrance.fr/franceinter/podcasts

This is my golfed script which shows the bug

#!/usr/bin/perl
package myparser;
use strict;
use warnings;
use v5.10;
use base qw(HTML::Parser);

sub text {
    my ($self, $text, $is_cdata) = @_;
    say "\"$text\"";
}

package main;
use strict;
use warnings;

my $p = myparser->new;
$p->parse_file(shift // exit);

Unfortunately, I can't post a golfed HTML snippet because when I try to reduce the size of the webpage, the bug disappear. So I will have to explain the exact steps I did to reproduce the bug.

In Chromium, go to https://www.radiofrance.fr/franceinter/podcasts. Then load the entire webpage by going at the bottom, clicking on "VOIR PLUS DE PODCASTS" repetitively until everything is loaded. Then save the webpage.

After that you just have to execute the script example with the downloaded page as argument.

The script prints all the text which is outside of any tag. Like this /tag>TEXT HERE<othertag

THE BUG The bug is that some "text elements" are splitted in 2. This happens for several podcast names. "Sur les épaules de Darwin" is one of those.

You can see that the script will output

"Sur les épaules de"
" Darwin"

instead of just "Sur les épaules de Darwin" This also happens to "Sur Les routes de la musique" (just below) and a few others.

Now, I found that when deleting <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">, just at the top of <head></head>, the bug disappear. And It also happens when deleting just ; charset=UTF-8

The problem is that the bug also disappear when I leave the charset as is and when I delete a bunch of the stuff inside <head></head> or I delete a lot of the divs corresponding to the other podcasts entries of the index.

This is all the information that I have.

oalders commented 2 years ago

@florian-pe thanks for this. Out of curiosity, do you have the same issue if you change the length of $chunk at https://metacpan.org/release/OALDERS/HTML-Parser-3.78/source/lib/HTML/Parser.pm#L94?

florian-pe commented 2 years ago

@oalders You are right, it does fix the problem. If I set the chunk size to 1024 bytes instead of 512, I still get the bug. But if I set it to 1_000_000, which is superior to the of the webpage (about 873KB), then there is no more splitting. At least on the 2 particular elements that I cited above.

oalders commented 2 years ago

@florian-pe what happens if you set unbroken_text(1) on your parser object?

my $p = myparser->new;
$p->unbroken_text(1);
$p->parse_file(shift // exit);
florian-pe commented 2 years ago

@oalders Yes it fixes the bug. It also produces the exact same output as

my $p = myparser->new;
$p->parse(do { local $/; <> });

I have read the man page at $p->unbroken_text but I don't understand, at all, what it does. But regardless, are you suggesting that I was misusing the library and that it is in fact "not a bug but a feature" ? Which is entirely possible.

oalders commented 2 years ago

Yes @florian-pe, I think for your use case you want this option enabled in the parser. So, it does not appear to me to be a bug.

florian-pe commented 2 years ago

@oalders I don't understand how it's not a bug if the problems originates from the subroutine parse_file() not correctly handling buffered input ? ie it does not deal with the fact that eventually, the boundary between 2 consecutively read chunks of bytes will be in the middle of a token, which apparently is not handled correctly because the end result is that some tokens are being splitted.

oalders commented 2 years ago

I didn't write the code, but just for some history, this sub enters into the codebase in 1996 with a chunk size of 2048:

https://github.com/libwww-perl/HTML-Parser/commit/aeb6d0ba14e680e6#diff-abe42eabebfc8528859aa468da65d562ea1c37c368905ddc25d8b10ad1f801b0R298

Not sure how relevant that is, but it's a fun fact!

I had a closer look at the docs for unbroken_text and as advertised, it does seem that this code should not be splitting tokens even with that option disabled. If you could distill this down to a small test case that demonstrates where tokens are being split, that would be the most helpful way to look at this, I think.

florian-pe commented 2 years ago

Alright, here's a simple example. So here's the same golfed script I used to demonstrate the bug, but slighlty modified.

#!/usr/bin/perl
package myparser;
use strict;
use warnings;
use v5.10;
use base qw(HTML::Parser);

sub text {
    my ($self, $text, $is_cdata) = @_;
    say "\"$text\"";
}

package main;
use strict;
use warnings;

my $begin = <<'END';
<!DOCTYPE html>
<html>
<head>
</head>
<body>
END

my $end = <<'END';
<span>splitted token</span>
</body>
</html> 
END

my $num = shift // exit;

open my $fh, ">", "page_test.html";
print $fh $begin;
print $fh "<span>", ("a" x  $num) ,"</span>";
print $fh $end;
close $fh;

my $p = myparser->new;
$p->parse_file("page_test.html");

We can use this one-liner to find the number of "a" needed so that the token "splitted token" will be splitted.

$ perl -E 'for $num (0 .. 2000) { my @out = map { chomp; $_ } qx{ ./golfed.pl $num }; if (!grep {/splitted token/} @out) { say $num; last } }'
434

Then if we run $ ./golfed.pl 434, we will see that the token splitted token is indeed splitted into 2.

If we count the number of characters with this

$ perl -E '$file = do { local $/; <> }; $count=1; for (split "", $file) { say "$count\t$_"; $count++ }' page_test.html | less

we see the 512th character happen to be the last "n" of "splitted token".

I redid that same little experiment, but removing <!DOCTYPE html> from the generated html page, and the bug happen for $num == 450. And again, the 512th character is the same one that in the previous test, ie the last "n" / last character of the string "splitted token".

508     t
509     o
510     k
511     e
512     n
513     <
514     /
515     s
516     p
517     a
518     n

I hope that helps, and that it convinces you that it is indeed a bug.

oalders commented 2 years ago

Thanks for this @florian-pe. That does look like a bug to me. Are you motivated to fix this?

florian-pe commented 2 years ago

That's a nice challenge. I tried to add print statment and it compiles OK. The problem is that, when I try to use the hand compiled module in a script, even with use lib "/home/user/perl/scripts/parsing_html/New Folder/HTML-Parser_BUILD/blib/lib/";, the script uses the normal CPAN module installed with cpanm instead.

So I can not even begin to poke around the C code.

florian-pe commented 2 years ago

Ok never mind. I think this was the problem https://perldoc.perl.org/XSLoader#LIMITATIONS I have uninstalled my cpanm version, did sudo make install, and now I can add print statements.