validator / htmlparser

The Validator.nu HTML parser https://about.validator.nu/htmlparser/
Other
56 stars 26 forks source link

Fix handling of "ambiguous ampersands" to report error when expected #83

Closed sideshowbarker closed 1 year ago

sideshowbarker commented 1 year ago

Fixes https://github.com/validator/htmlparser/issues/82. 44427c0df60ec899c52b149bba03e1abffad2f5e broke handling of "ambiguous ampersands" for the case where we've already examined enough of string after the ampersand to conclude that the substring we have so far doesn't match the beginning of any known named character reference.

This change re-conforms the error-reporting for that case to the requirements in the HTML spec. Otherwise, without this change, no error is reported as expected in many or most cases where an ampersand doesn't actually start a character reference.

dill0wn commented 1 year ago

This change started triggering a bunch of new errors for us when checking against https://validator.w3.org/nu/.

The lines are all lines similar to this:

<a href="https://foo.example.com?query=q&sort=-1">An Hyperlink!</a>

The"&" in the href's query string triggers this syntax violation.

I just would like to double check that this is bad practice and something we should fix in our code. Because it seems pretty innocuous.

nschonni commented 1 year ago

@dill0wn that looks like a regression to me. I don't think it should be flagging the query string parameters

sideshowbarker commented 1 year ago

@dill0wn that looks like a regression to me. I don't think it should be flagging the query string parameters

Yeah, it‘s a regression. See also https://github.com/validator/validator/issues/1595

sideshowbarker commented 1 year ago

@dill0wn

I just would like to double check that this is bad practice and something we should fix in our code. Because it seems pretty innocuous.

Per the current spec, it’s not bad practice and not something you need to fix in your code in order for it to be valid.

We changed the parser algorithm in the spec years ago to allow for ampersands in URLs in attribute values, while still disallowing it in other cases where we wanted to disallow it — and then we attempted to change the parser to match the behavior required by the spec. It seems now we got the allow-in-attribute-values behavior done right in the parser, but as a result, somewhere along the lines, we seem to have ended up allowing ampersands everywhere. So we still need to fix that. But in the meantime, I need to revert this change.

sideshowbarker commented 1 year ago

I’ve fixed this (reverted the patch in the PR) in the HTML checker sources and pushed the fix to https://validator.w3.org/nu/