libwww-perl / HTML-Parser

The HTML-Parser distribution is is a collection of modules that parse and extract information from HTML documents.
Other
6 stars 13 forks source link

Attributes that have no value get their name as their value #17

Open simbabque opened 4 years ago

simbabque commented 4 years ago

When investigating libwww-perl/WWW-Mechanize#125 I noticed that the following HTML parses weirdly.

<input type="hidden" name="foo" value>

According to the HTML spec on an input element a value attribute that's not followed by an equals = should be empty, so we should be parsing it to an empty string.

Empty attribute syntax Just the attribute name. The value is implicitly the empty string.

Instead of making it empty, we set it to "value".

I've looked into it, and got as far as that get_tag returns a data structure that contains the wrong value:

\ [
    [0] "input",
    [1] {
        /       "/",
        name    "foo",
        type    "hidden",
        value   "value"
    },
    [2] [
        [0] "type",
        [1] "name",
        [2] "value",
        [3] "/"
    ],
    [3] "<input type="hidden" name="foo" value />"
]

Unfortunately I am out of my depths with the actual C code for the parser. But I think, we should be returning an empty string for the value attribute, as well as all other empty attributes.


I wrote the following test to demonstrates the problem.

use strict;
use warnings;

use HTML::TokeParser ();
use Test::More;
use Data::Dumper;

ok(
    !get_tag(q{})->{value},
    'No value when there was no value'
);    # key does not exist

{
    # this fails because value is 'value'
    my $t = get_tag(q{value});
    ok(
        !$t->{value},
        'No value when value attr has no value'
    ) or diag Dumper $t;    
}

ok(
    !get_tag(q{value=""})->{value},
    'No value when value attr is an empty string'
);    # key is an empty string

is(
    get_tag(q{value="bar"})->{value}, 
    'bar', 
    'Value is bar'
);    # this obviously works

sub get_tag {
    my $attr = shift;
    return HTML::TokeParser->new(\qq{<input type="hidden" name="foo" $attr />})->get_tag->[1];
}

done_testing;
oalders commented 4 years ago

Thanks for the test case @simbabque!

oalders commented 4 years ago

It might be helpful to add this as a TODO test as well.

andyjack commented 3 years ago

I've also run into this issue a couple of times - digging deeper into HTML::Parser reveals that setting the value to the name is intentional! There is an option to control what value is "parsed" when an attribute has no value.

; perl -MHTML::TokeParser -MDDP -lE 'my $p = HTML::TokeParser->new( doc => \qq{<input type="text" name="abc123" value>}, boolean_attribute_value=>"no value!")->get_tag->[1]; p $p'
{
    name    "abc123",
    type    "text",
    value   "no value!"
}

The current design of "return the name" doesn't seem sensible to me - having the default setting for the option be undef or q{} would align the module with the HTML spec, but that would probably affect downstream code.

Here's where setting the value to the name happens in the C code.

wolfsage commented 3 years ago

This part of the parser specifically mentions 'boolean' - I believe it's referring to this:

https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#boolean-attributes

If the attribute is present, its value must either be the empty string or a value that is an ASCII case-insensitive match for the attribute's canonical name, with no leading or trailing whitespace.

and

Example:

Here is an example of a checkbox that is checked and disabled. The checked and disabled attributes are the boolean attributes.

<label><input type=checkbox checked name=cheese disabled> Cheese</label>

This could be equivalently written as this:

<label><input type=checkbox checked=checked name=cheese disabled=disabled> Cheese</label>

I think what this means is that HTML::Parser needs to be aware of the types of the attributes its parsing, which makes it seem like the fix won't be so easy?

oalders commented 3 years ago

This blog post states that there are 25 attributes which are boolean. https://meiert.com/en/blog/boolean-attributes-of-html/

If that's correct, they could be special-cased, but from my quick digging I didn't find a definitive list elsewhere, so I'm not confident in this yet.

andyjack commented 3 years ago

Thanks for the info about boolean attributes - the "return the name" behavior makes sense now.

Does a user of HTML::Parser care about differentiating between <... checked ...>, <... checked="checked" ...> and <... checked="" ...> being parsed? This might affect the fix for the issue.