validator / htmlparser

The Validator.nu HTML parser https://about.validator.nu/htmlparser/
Other
56 stars 26 forks source link

Ambiguous ampersands are not detected #82

Open ezequiel-garzon opened 1 year ago

ezequiel-garzon commented 1 year ago

According to the HTML standard, "an ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by one or more ASCII alphanumerics, followed by a U+003B SEMICOLON character (;), where these characters do not match any of the names given in the named character references section".

Maybe I'm missing something, but shouldn't then something like &ThisAmpersandShouldBeDeemedAmbiguous; raise an error, or a warning? I know it used to, but I've checked both on a Mac with version 23.4.11, as well as on https://validator.w3.org/, and no error or warning is now raised. Thanks in advance.

sideshowbarker commented 1 year ago

@ezequiel-garzon Thanks much for catching it. It was a great catch because, embarrassingly, it appears we’ve unfortunately had this bug for almost two years now — and the effect of it is that for those almost two years now, the HTML checker hasn’t been reporting any errors for almost all cases of invalid named character references.

In other words, when, for example, people have accidentally made minor spelling mistakes to otherwise-valid named character references, the HTML checker hasn’t been catching that and reporting it so that they can fix their spelling mistakes.

I’ve fixed this in a feature branch with https://github.com/validator/htmlparser/pull/83 — and for now, I’ve switched the HTML checker to being built from that branch, and pushed the updates to https://validator.w3.org/nu/

But I’ll keep this issue open until the fix gets merged into the main branch of the HTML parser code here.

ezequiel-garzon commented 1 year ago

My pleasure, @sideshowbarker. Thank you for taking care of this and so many other projects. I checked many, many times before reporting as I thought I was doing something wrong.