libwww-perl / HTML-Parser

The HTML-Parser distribution is is a collection of modules that parse and extract information from HTML documents.
Other
6 stars 13 forks source link

Edge case with trailing text unreported #23

Open jackdeguest opened 2 years ago

jackdeguest commented 2 years ago

With HTML::Parser v3.76.

Consider the following chunk of data:

Hello world! <span class="highlight">Isn't this wonderful</span> really?

Creating an object, such as:

# using curry and assuming this runs within a module that acts as a simple wrapper, nothing fancy.
my $p = HTML::Parser->new(
        api_version => 3, 
        start_h => [ $self->curry::add_start, 'self, tagname, attr, attrseq, text, column, line, offset, offset_end'],
        end_h   => [ $self->curry::add_end,   'self, tagname, attr, attrseq, text, column, line, offset, offset_end' ],
        marked_sections => 1,
        comment_h => [ $self->curry::add_comment, 'self, text, column, line, offset, offset_end'],
        declaration_h => [ $self->curry::add_declaration, 'self, text, column, line, offset, offset_end'],
        default_h => [ $self->curry::add_default, 'self, tagname, attr, attrseq, text, column, line, offset, offset_end'],
        text_h => [ $self->curry::add_text, 'self, text, column, line, offset, offset_end'],
        empty_element_tags => 1,
        end_document_h => [ $self->curry::end_document, 'self, skipped_text'],
);
$p->parse( $html );
sub add_text
{
    my $self = shift( @_ );
    print( "got '", $_[1], "'\n" );
}

And this would yield:

got 'Hello world! '
got 'Isn't this wonderful'

However, ' really?' is not being reported. One has to explicitly call $p->eof to have the trailing text reported. If this is an intended feature, then it ought to be made clear in the documentation. However, I think one should not have to call eof to get that last trailing text.

oalders commented 2 years ago

The docs say:

Calling the $p->eof method outside a handler callback will flush any remaining buffered text (which triggers the text event if there is any remaining text).

I think that seems reasonable in your case, where the fragment is not wholly enclosed by tags. I had to go back and check one of my projects and I see that I'm calling eof here: https://metacpan.org/release/OALDERS/HTML-Restrict-v3.0.0/source/lib/HTML/Restrict.pm#L317 (likely for similar reasons).

eof is in the SYNOPSIS as well, having said that, if you'd like to suggest an improvement to the documentation, that would be welcome. I think it could be clearer.

jackdeguest commented 2 years ago

The docs say:

Calling the $p->eof method outside a handler callback will flush any remaining buffered text (which triggers the text event if there is any remaining text).

I think that seems reasonable in your case, where the fragment is not wholly enclosed by tags. I had to go back and check one of my projects and I see that I'm calling eof here: https://metacpan.org/release/OALDERS/HTML-Restrict-v3.0.0/source/lib/HTML/Restrict.pm#L317 (likely for similar reasons).

eof is in the SYNOPSIS as well, having said that, if you'd like to suggest an improvement to the documentation, that would be welcome. I think it could be clearer.

I think the doc makes it clear for sure, but expectations might be different. If one calls parse on a string, and the parser reaches the end of that string, just like if it reaches the end of a file, wouldn't it be reasonable to assume the buffer would need to be flushed? Maybe there are good reason not to do so?

oalders commented 2 years ago

Maybe there are good reason not to do so?

Maybe some digging through the original commits would make this clear? I wrote none of this code, so I actually just don't know.