validator / htmlparser

The Validator.nu HTML parser https://about.validator.nu/htmlparser/
Other
56 stars 26 forks source link

Conform tokenizer-only U+0000 NUL handling to spec #40

Closed sideshowbarker closed 3 years ago

sideshowbarker commented 4 years ago

This change brings the tokenizer’s handling of U+0000 NUL characters in the DATA state and the CDATA section state into conformance with the requirements in the HTML spec — for the case where only tokenization is being performed, without tree construction; that is, the case where the tokenizer() method is called, rather than parse() or parseFragment().

Specifically, the tokenization steps defined in the spec require that when a U+0000 NUL is consumed in the DATA state or in the CDATA section state, the parser must then emit a U+0000 NUL. But when performing tree construction, the spec requires that when a U+0000 NUL is consumed, the parser must instead emit a U+FFFD REPLACEMENT CHARACTER.

Without this change, the parser always emits a U+FFFD REPLACEMENT CHARACTER — even when only tokenization is being performed. That causes us to fail a number of tests in html5lib-tests suite.

For more background on the relevant behavior, see the following:

Relates to https://github.com/validator/htmlparser/issues/35

hsivonen commented 3 years ago

Landed as 9d72e928e1e683f7eac9946678ed1b4a3d94175a before realizing there was a PR.

Thanks, and sorry about the resulting bad metadata on this PR>