validator / htmlparser

The Validator.nu HTML parser https://about.validator.nu/htmlparser/
Other
56 stars 26 forks source link

Conform encoding handling to Encoding spec #48

Open sideshowbarker opened 4 years ago

sideshowbarker commented 4 years ago

This makes the parser’s handling of encodings conformant with current requirements in the Encoding spec and HTML spec.

To that end, this PR branch includes the following important changes:

81c40ef Conform MetaSniffer "x-user-defined" handling f9b3796 Sync w/ “Changing the encoding while parsing” algo ceb9fe3 Drop UTF-32 & “UTF-16”; use UTF-16BE and UTF-16LE 9d2d39d Drop Encoding.toAsciiLowerCase; use toLowerCase() beb060b Conform encoding-label matching to Encoding spec 8c62d2d Conform supported encodings to Encoding spec 5cfabbe Drop superseded encoding-checking code 343d86f Drop “actual HTML encoding”-related code 03954fb Drop/replace “is ASCII superset”-checking code

It also includes some related minor changes to get things working smoothly.

The beb060b change is particularly notable in that it fixes the parser’s encoding-name matching to conform to the current Encoding spec at https://encoding.spec.whatwg.org/#concept-encoding-get — which requires that only leading and trailing whitespace be removed from a string before checking if it matches any valid encoding names.

Otherwise, without that change, the parser instead implements https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching — which requires deleting “all characters except a-z, A-Z, and 0-9” from a string before checking if it matches any valid encoding names. That difference makes us fail two html5lib-tests cases.

The html5lib-tests cases that we fail without that beb060b change are the following:

sideshowbarker commented 4 years ago

I’m seeing some possible regressions in the HTML checker as a result of this —

error: The encoding “utf-8” is not an IANA-registered encoding and did not use the “x-” prefix. (Charmod C023)
info warning: The character encoding “utf-8” is not widely supported. Better interoperability may be achieved by using “UTF-8”.
info warning: Using “x-iso-2022-cn-gb” instead of the declared encoding “utf-8”.

That is, this change seems to have caused the parser to fail to parse utf-8 case-insensitively.

And then for some reason, the parser seems to randomly picking some other encoding to use — the encoding name in the last error message above is non-deterministic. When I re-run the tests, different encoding names show up:

info warning: Using “x-machebrew” instead of the declared encoding “utf-8”.

or:

info warning: Using “x-maccroatian” instead of the declared encoding “utf-8”.

… etc.

sideshowbarker commented 4 years ago

d’oh — never mind my previous comment; I see now that I just neglected to negate the condition. Will push an update

sideshowbarker commented 4 years ago

OK, I found that the existing code had effectively been using, as the internal canonical name, names with all dashes and underscores removed. That results in a conformance violation, because it means that, for example, we would treat the string ut-f8 (invalid as an encoding name) the same as utf-8 or utf8.

So with 92275d9, I’ve updated the checking to conform to exactly what the Encoding spec requires — which is to do an exact comparison of the lower-cased input string to the set of labels in https://encoding.spec.whatwg.org/#names-and-labels

sideshowbarker commented 4 years ago

OK, I’ve now pushed a large number of commits to this branch that together make the parser’s handling of encodings conformant with current requirements in the Encoding spec and HTML spec. The key changes are the following:

81c40ef Conform MetaSniffer "x-user-defined" handling f9b3796 Sync w/ “Changing the encoding while parsing” algo ceb9fe3 Drop UTF-32 & “UTF-16”; use UTF-16BE and UTF-16LE 9d2d39d Drop Encoding.toAsciiLowerCase; use toLowerCase() beb060b Conform encoding-label matching to Encoding spec 8c62d2d Conform supported encodings to Encoding spec 5cfabbe Drop superseded encoding-checking code 343d86f Drop “actual HTML encoding”-related code 03954fb Drop/replace “is ASCII superset”-checking code

The branch also includes some related minor changes to get things working smoothly.

I broke the changes out into separate commits in order to facilitate easier review.

So in reviewing this, you probably want to be starting from the https://github.com/validator/htmlparser/pull/48/commits view rather than the https://github.com/validator/htmlparser/pull/48/files view.

sideshowbarker commented 4 years ago

OK, I’m now a bit embarrassed to admit: it hadn’t dawned on me until now that the changes to the Encoding class in this patch are completely superseded by the code in the encodings branch of the repo.

So I’m now wondering whether we should just not spend any more time at all on this patch but instead try to get the code on the encodings branch finished and ready to merge to the main branch and to release.

I’m certainly willing to put time into helping with that.

carlosame commented 4 years ago

get the code on the encodings branch finished and ready to merge

My understanding was that encodings is an archive branch kept for reference, not initially intended for merging back. Perhaps I'm missing something.