libwww-perl / HTML-Parser

The HTML-Parser distribution is is a collection of modules that parse and extract information from HTML documents.
Other
6 stars 13 forks source link
hacktoberfest

Actions Status Actions Status Actions Status

NAME

HTML::Parser - HTML parser class

SYNOPSIS

use strict;
use warnings;
use HTML::Parser ();

# Create parser object
my $p = HTML::Parser->new(
    api_version     => 3,
    start_h         => [\&start, "tagname, attr"],
    end_h           => [\&end,   "tagname"],
    marked_sections => 1,
);

# Parse document text chunk by chunk
$p->parse($chunk1);
$p->parse($chunk2);

# ...
# signal end of document
$p->eof;

# Parse directly from file
$p->parse_file("foo.html");

# or
open(my $fh, "<:utf8", "foo.html") || die;
$p->parse_file($fh);

DESCRIPTION

Objects of the HTML::Parser class will recognize markup and separate it from plain text (alias data content) in HTML documents. As different kinds of markup and text are recognized, the corresponding event handlers are invoked.

HTML::Parser is not a generic SGML parser. We have tried to make it able to deal with the HTML that is actually "out there", and it normally parses as closely as possible to the way the popular web browsers do it instead of strictly following one of the many HTML specifications from W3C. Where there is disagreement, there is often an option that you can enable to get the official behaviour.

The document to be parsed may be supplied in arbitrary chunks. This makes on-the-fly parsing as documents are received from the network possible.

If event driven parsing does not feel right for your application, you might want to use HTML::PullParser. This is an HTML::Parser subclass that allows a more conventional program structure.

METHODS

The following method is used to construct a new HTML::Parser object:

The following methods feed the HTML document to the HTML::Parser object:

Most parser options are controlled by boolean attributes. Each boolean attribute is enabled by calling the corresponding method with a TRUE argument and disabled with a FALSE argument. The attribute value is left unchanged if no argument is given. The return value from each method is the old attribute value.

Methods that can be used to get and/or set parser options are:

As markup and text is recognized, handlers are invoked. The following method is used to set up handlers for different events:

Filters based on tags can be set up to limit the number of events reported. The main bottleneck during parsing is often the huge number of callbacks made from the parser. Applying filters can improve performance significantly.

The following methods control filters:

Internally, the system has two filter lists, one for report_tags and one for ignore_tags, and both filters are applied. This effectively gives ignore_tags precedence over report_tags.

Examples:

$p->ignore_tags(qw(style));
$p->report_tags(qw(script style));

results in only script events being reported.

Argspec

Argspec is a string containing a comma-separated list that describes the information reported by the event. The following argspec identifier names can be used:

The whole argspec string can be wrapped up in '@{...}' to signal that the resulting event array should be flattened. This only makes a difference if an array reference is used as the handler target. Consider this example:

$p->handler(text => [], 'text');
$p->handler(text => [], '@{text}']);

With two text events; "foo", "bar"; then the first example will end up with [["foo"], ["bar"]] and the second with ["foo", "bar"] in the handler target array.

Events

Handlers for the following events can be registered:

Unicode

If Unicode is passed to $p->parse() then chunks of Unicode will be reported to the handlers. The offset and length argspecs will also report their position in terms of characters.

It is safe to parse raw undecoded UTF-8 if you either avoid decoding entities and make sure to not use argspecs that do, or enable the utf8_mode for the parser. Parsing of undecoded UTF-8 might be useful when parsing from a file where you need the reported offsets and lengths to match the byte offsets in the file.

If a filename is passed to $p->parse_file() then the file will be read in binary mode. This will be fine if the file contains only ASCII or Latin-1 characters. If the file contains UTF-8 encoded text then care must be taken when decoding entities as described in the previous paragraph, but better is to open the file with the UTF-8 layer so that it is decoded properly:

open(my $fh, "<:utf8", "index.html") || die "...: $!";
$p->parse_file($fh);

If the file contains text encoded in a charset besides ASCII, Latin-1 or UTF-8 then decoding will always be needed.

VERSION 2 COMPATIBILITY

When an HTML::Parser object is constructed with no arguments, a set of handlers is automatically provided that is compatible with the old HTML::Parser version 2 callback methods.

This is equivalent to the following method calls:

$p->handler(start   => "start",   "self, tagname, attr, attrseq, text");
$p->handler(end     => "end",     "self, tagname, text");
$p->handler(text    => "text",    "self, text, is_cdata");
$p->handler(process => "process", "self, token0, text");
$p->handler(
    comment => sub {
        my ($self, $tokens) = @_;
        for (@$tokens) { $self->comment($_); }
    },
    "self, tokens"
);
$p->handler(
    declaration => sub {
        my $self = shift;
        $self->declaration(substr($_[0], 2, -1));
    },
    "self, text"
);

Setting up these handlers can also be requested with the "api_version => 2" constructor option.

SUBCLASSING

The HTML::Parser class is able to be subclassed. Parser objects are plain hashes and HTML::Parser reserves only hash keys that start with "_hparser". The parser state can be set up by invoking the init() method, which takes the same arguments as new().

EXAMPLES

The first simple example shows how you might strip out comments from an HTML document. We achieve this by setting up a comment handler that does nothing and a default handler that will print out anything else:

use HTML::Parser ();
HTML::Parser->new(
    default_h => [sub { print shift }, 'text'],
    comment_h => [""],
)->parse_file(shift || die)
    || die $!;

An alternative implementation is:

use HTML::Parser ();
HTML::Parser->new(
    end_document_h => [sub { print shift }, 'skipped_text'],
    comment_h      => [""],
)->parse_file(shift || die)
    || die $!;

This will in most cases be much more efficient since only a single callback will be made.

The next example prints out the text that is inside the <title> element of an HTML document. Here we start by setting up a start handler. When it sees the title start tag it enables a text handler that prints any text found and an end handler that will terminate parsing as soon as the title end tag is seen:

use HTML::Parser ();

sub start_handler {
    return if shift ne "title";
    my $self = shift;
    $self->handler(text => sub { print shift }, "dtext");
    $self->handler(
        end => sub {
            shift->eof if shift eq "title";
        },
        "tagname,self"
    );
}

my $p = HTML::Parser->new(api_version => 3);
$p->handler(start => \&start_handler, "tagname,self");
$p->parse_file(shift || die) || die $!;
print "\n";

More examples are found in the eg/ directory of the HTML-Parser distribution: the program hrefsub shows how you can edit all links found in a document; the program htextsub shows how to edit the text only; the program hstrip shows how you can strip out certain tags/elements and/or attributes; and the program htext show how to obtain the plain text, but not any script/style content.

You can browse the eg/ directory online from the [Browse] link on the http://search.cpan.org/~gaas/HTML-Parser/ page.

BUGS

The <style> and <script> sections do not end with the first "</", but need the complete corresponding end tag. The standard behaviour is not really practical.

When the strict_comment option is enabled, we still recognize comments where there is something other than whitespace between even and odd "--" markers.

Once $p->boolean_attribute_value has been set, there is no way to restore the default behaviour.

There is currently no way to get both quote characters into the same literal argspec.

Empty tags, e.g. "<>" and "</>", are not recognized. SGML allows them to repeat the previous start tag or close the previous start tag respectively.

NET tags, e.g. "code/.../" are not recognized. This is SGML shorthand for "<code>...</code>".

Incomplete start or end tags, e.g. "<tt<b>...</b</tt>" are not recognized.

DIAGNOSTICS

The following messages may be produced by HTML::Parser. The notation in this listing is the same as used in perldiag:

SEE ALSO

HTML::Entities, HTML::PullParser, HTML::TokeParser, HTML::HeadParser, HTML::LinkExtor, HTML::Form

HTML::TreeBuilder (part of the HTML-Tree distribution)

http://www.w3.org/TR/html4/

More information about marked sections and processing instructions may be found at http://www.is-thought.co.uk/book/sgml-8.htm.

COPYRIGHT

Copyright 1996-2016 Gisle Aas. All rights reserved.
Copyright 1999-2000 Michael A. Chase.  All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.