Open simbabque opened 4 years ago
Thanks for the test case @simbabque!
It might be helpful to add this as a TODO test as well.
I've also run into this issue a couple of times - digging deeper into HTML::Parser reveals that setting the value to the name is intentional! There is an option to control what value is "parsed" when an attribute has no value.
; perl -MHTML::TokeParser -MDDP -lE 'my $p = HTML::TokeParser->new( doc => \qq{<input type="text" name="abc123" value>}, boolean_attribute_value=>"no value!")->get_tag->[1]; p $p'
{
name "abc123",
type "text",
value "no value!"
}
The current design of "return the name" doesn't seem sensible to me - having the default setting for the option be undef
or q{}
would align the module with the HTML spec, but that would probably affect downstream code.
Here's where setting the value to the name happens in the C code.
This part of the parser specifically mentions 'boolean' - I believe it's referring to this:
https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#boolean-attributes
If the attribute is present, its value must either be the empty string or a value that is an ASCII case-insensitive match for the attribute's canonical name, with no leading or trailing whitespace.
and
Example:
Here is an example of a checkbox that is checked and disabled. The checked and disabled attributes are the boolean attributes.
<label><input type=checkbox checked name=cheese disabled> Cheese</label>
This could be equivalently written as this:
<label><input type=checkbox checked=checked name=cheese disabled=disabled> Cheese</label>
I think what this means is that HTML::Parser needs to be aware of the types of the attributes its parsing, which makes it seem like the fix won't be so easy?
This blog post states that there are 25 attributes which are boolean. https://meiert.com/en/blog/boolean-attributes-of-html/
If that's correct, they could be special-cased, but from my quick digging I didn't find a definitive list elsewhere, so I'm not confident in this yet.
Thanks for the info about boolean attributes - the "return the name" behavior makes sense now.
Does a user of HTML::Parser care about differentiating between <... checked ...>
, <... checked="checked" ...>
and <... checked="" ...>
being parsed? This might affect the fix for the issue.
When investigating libwww-perl/WWW-Mechanize#125 I noticed that the following HTML parses weirdly.
According to the HTML spec on an input element a value attribute that's not followed by an equals
=
should be empty, so we should be parsing it to an empty string.Instead of making it empty, we set it to "value".
I've looked into it, and got as far as that get_tag returns a data structure that contains the wrong value:
Unfortunately I am out of my depths with the actual C code for the parser. But I think, we should be returning an empty string for the value attribute, as well as all other empty attributes.
I wrote the following test to demonstrates the problem.