This change brings the tokenizer’s handling of U+0000 NUL characters in the DATA state and the CDATA section state into conformance with the requirements in the HTML spec — for the case where only tokenization is being performed, without tree construction; that is, the case where the tokenizer() method is called, rather than parse() or parseFragment().
Specifically, the tokenization steps defined in the spec require that when a U+0000 NUL is consumed in the DATA state or in the CDATA section state, the parser must then emit a U+0000 NUL. But when performing tree construction, the spec requires that when a U+0000 NUL is consumed, the parser must instead emit a U+FFFD REPLACEMENT CHARACTER.
Without this change, the parser always emits a U+FFFD REPLACEMENT CHARACTER — even when only tokenization is being performed. That causes us to fail a number of tests in html5lib-tests suite.
For more background on the relevant behavior, see the following:
This change brings the tokenizer’s handling of U+0000 NUL characters in the DATA state and the CDATA section state into conformance with the requirements in the HTML spec — for the case where only tokenization is being performed, without tree construction; that is, the case where the
tokenizer()
method is called, rather thanparse()
orparseFragment()
.Specifically, the tokenization steps defined in the spec require that when a U+0000 NUL is consumed in the DATA state or in the CDATA section state, the parser must then emit a U+0000 NUL. But when performing tree construction, the spec requires that when a U+0000 NUL is consumed, the parser must instead emit a U+FFFD REPLACEMENT CHARACTER.
Without this change, the parser always emits a U+FFFD REPLACEMENT CHARACTER — even when only tokenization is being performed. That causes us to fail a number of tests in html5lib-tests suite.
For more background on the relevant behavior, see the following:
Relates to https://github.com/validator/htmlparser/issues/35