Open florian-pe opened 2 years ago
@florian-pe thanks for this. Out of curiosity, do you have the same issue if you change the length of $chunk
at https://metacpan.org/release/OALDERS/HTML-Parser-3.78/source/lib/HTML/Parser.pm#L94?
@oalders You are right, it does fix the problem. If I set the chunk size to 1024 bytes instead of 512, I still get the bug. But if I set it to 1_000_000, which is superior to the of the webpage (about 873KB), then there is no more splitting. At least on the 2 particular elements that I cited above.
@florian-pe what happens if you set unbroken_text(1)
on your parser object?
my $p = myparser->new;
$p->unbroken_text(1);
$p->parse_file(shift // exit);
@oalders Yes it fixes the bug. It also produces the exact same output as
my $p = myparser->new;
$p->parse(do { local $/; <> });
I have read the man page at $p->unbroken_text
but I don't understand, at all, what it does.
But regardless, are you suggesting that I was misusing the library and that it is in fact "not a bug but a feature" ?
Which is entirely possible.
Yes @florian-pe, I think for your use case you want this option enabled in the parser. So, it does not appear to me to be a bug.
@oalders I don't understand how it's not a bug if the problems originates from the subroutine parse_file()
not correctly handling buffered input ? ie it does not deal with the fact that eventually, the boundary between 2 consecutively read chunks of bytes will be in the middle of a token, which apparently is not handled correctly because the end result is that some tokens are being splitted.
I didn't write the code, but just for some history, this sub enters into the codebase in 1996 with a chunk size of 2048:
Not sure how relevant that is, but it's a fun fact!
I had a closer look at the docs for unbroken_text
and as advertised, it does seem that this code should not be splitting tokens even with that option disabled. If you could distill this down to a small test case that demonstrates where tokens are being split, that would be the most helpful way to look at this, I think.
Alright, here's a simple example. So here's the same golfed script I used to demonstrate the bug, but slighlty modified.
#!/usr/bin/perl
package myparser;
use strict;
use warnings;
use v5.10;
use base qw(HTML::Parser);
sub text {
my ($self, $text, $is_cdata) = @_;
say "\"$text\"";
}
package main;
use strict;
use warnings;
my $begin = <<'END';
<!DOCTYPE html>
<html>
<head>
</head>
<body>
END
my $end = <<'END';
<span>splitted token</span>
</body>
</html>
END
my $num = shift // exit;
open my $fh, ">", "page_test.html";
print $fh $begin;
print $fh "<span>", ("a" x $num) ,"</span>";
print $fh $end;
close $fh;
my $p = myparser->new;
$p->parse_file("page_test.html");
We can use this one-liner to find the number of "a" needed so that the token "splitted token" will be splitted.
$ perl -E 'for $num (0 .. 2000) { my @out = map { chomp; $_ } qx{ ./golfed.pl $num }; if (!grep {/splitted token/} @out) { say $num; last } }'
434
Then if we run $ ./golfed.pl 434
,
we will see that the token splitted token
is indeed splitted into 2.
If we count the number of characters with this
$ perl -E '$file = do { local $/; <> }; $count=1; for (split "", $file) { say "$count\t$_"; $count++ }' page_test.html | less
we see the 512th character happen to be the last "n" of "splitted token".
I redid that same little experiment, but removing <!DOCTYPE html>
from the generated html page, and the bug happen for $num == 450.
And again, the 512th character is the same one that in the previous test, ie the last "n" / last character of the string "splitted token".
508 t
509 o
510 k
511 e
512 n
513 <
514 /
515 s
516 p
517 a
518 n
I hope that helps, and that it convinces you that it is indeed a bug.
Thanks for this @florian-pe. That does look like a bug to me. Are you motivated to fix this?
That's a nice challenge. I tried to add print statment and it compiles OK. The problem is that, when I try to use the hand compiled module in a script, even with use lib "/home/user/perl/scripts/parsing_html/New Folder/HTML-Parser_BUILD/blib/lib/";
, the script uses the normal CPAN module installed with cpanm instead.
So I can not even begin to poke around the C code.
Ok never mind.
I think this was the problem
https://perldoc.perl.org/XSLoader#LIMITATIONS
I have uninstalled my cpanm version, did sudo make install
, and now I can add print statements.
This problem happens on a particular webpage https://www.radiofrance.fr/franceinter/podcasts
This is my golfed script which shows the bug
Unfortunately, I can't post a golfed HTML snippet because when I try to reduce the size of the webpage, the bug disappear. So I will have to explain the exact steps I did to reproduce the bug.
In Chromium, go to https://www.radiofrance.fr/franceinter/podcasts. Then load the entire webpage by going at the bottom, clicking on "VOIR PLUS DE PODCASTS" repetitively until everything is loaded. Then save the webpage.
After that you just have to execute the script example with the downloaded page as argument.
The script prints all the text which is outside of any tag. Like this
/tag>TEXT HERE<othertag
THE BUG The bug is that some "text elements" are splitted in 2. This happens for several podcast names.
"Sur les épaules de Darwin"
is one of those.You can see that the script will output
instead of just
"Sur les épaules de Darwin"
This also happens to"Sur Les routes de la musique"
(just below) and a few others.Now, I found that when deleting
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
, just at the top of<head></head>
, the bug disappear. And It also happens when deleting just; charset=UTF-8
The problem is that the bug also disappear when I leave the charset as is and when I delete a bunch of the stuff inside
<head></head>
or I delete a lot of thediv
s corresponding to the other podcasts entries of the index.This is all the information that I have.