validator / htmlparser

The Validator.nu HTML parser https://about.validator.nu/htmlparser/
Other
56 stars 26 forks source link

Investigate 30 failing tokenizer tests #35

Closed sideshowbarker closed 3 years ago

sideshowbarker commented 4 years ago

We’re failing 30 tests cases in https://github.com/html5lib/html5lib-tests/tree/master/tokenizer/ (see below). At least 10 of them are related to handling of U+0000 NUL characters.

I don’t understand why the Java parser is failing these but the Firefox parser isn’t.

--------------------------------
Failure
Raw NUL replacement
Input:
\u0000
Expected tokens:
[["Character","\\uFFFD"]]
Actual tokens:
[["Character","\\u0000"]]
--------------------------------
Failure
Raw NUL replacement
Input:
\u0000
Expected tokens:
[["Character","\\uFFFD"]]
Actual tokens:
[["Character","\\u0000"]]
--------------------------------
Failure
Raw NUL replacement
Input:
\u0000
Expected tokens:
[["Character","\\uFFFD"]]
Actual tokens:
[["Character","\\u0000"]]
--------------------------------
Failure
Raw NUL replacement
Input:
\u0000
Expected tokens:
[["Character","\\uFFFD"]]
Actual tokens:
[["Character","\\u0000"]]
--------------------------------
Failure
NUL in CDATA section
Input:
\u0000]]>
Expected tokens:
[["Character","\\u0000"]]
Actual tokens:
[["Character","\\u0000]]>"]]
--------------------------------
Failure
NUL in script HTML comment
Input:
<!--test\u0000--><!--test-\u0000--><!--test--\u0000-->
Expected tokens:
[["Character","<!--test\\uFFFD--><!--test-\\uFFFD--><!--test--\\uFFFD-->"]]
Actual tokens:
[["Character","<!--test\\u0000--><!--test-\\u0000--><!--test--\\u0000-->"]]
--------------------------------
Failure
NUL in script HTML comment - double escaped
Input:
<!--<script>\u0000--><!--<script>-\u0000--><!--<script>--\u0000-->
Expected tokens:
[["Character","<!--<script>\\uFFFD--><!--<script>-\\uFFFD--><!--<script>--\\uFFFD-->"]]
Actual tokens:
[["Character","<!--<script>\\u0000--><!--<script>-\\u0000--><!--<script>--\\u0000-->"]]
--------------------------------
Failure
lowercase endtags
Input:
</XMP>
Expected tokens:
[["EndTag","xmp"]]
Actual tokens:
[["Character","</XMP>"]]
--------------------------------
Failure
--!NUL in comment 
Input:
<!----!\u0000-->
Expected tokens:
[["Comment","--!\\uFFFD"]]
Actual tokens:
[["Comment","--!\\u0000"]]
--------------------------------
Failure
CDATA content
Input:
foo&#32;]]>
Expected tokens:
[["Character","foo&#32;"]]
Actual tokens:
[["Character","foo ]]>"]]
--------------------------------
Failure
CDATA followed by HTML content
Input:
foo&#32;]]>&#32;
Expected tokens:
[["Character","foo&#32; "]]
Actual tokens:
[["Character","foo ]]> "]]
--------------------------------
Failure
CDATA with extra bracket
Input:
foo]]]>
Expected tokens:
[["Character","foo]"]]
Actual tokens:
[["Character","foo]]]>"]]
--------------------------------
Failure
DOCTYPE without name
Input:
<!DOCTYPE>
Expected tokens:
[["DOCTYPE",null,null,null,false]]
Actual tokens:
[["DOCTYPE","",null,null,false]]
--------------------------------
Failure
Null Byte Replacement
Input:

Expected tokens:
[["Character",""]]
Actual tokens:
[["Character","�"]]
--------------------------------
Failure
<\u0000
Input:
<
Expected tokens:
[["Character","<"]]
Actual tokens:
[["Character","<�"]]
--------------------------------
Failure
<!DOCTYPE
Input:
<!DOCTYPE
Expected tokens:
[["DOCTYPE",null,null,null,false]]
Actual tokens:
[["DOCTYPE","",null,null,false]]
--------------------------------
Failure
<!DOCTYPE\u0009
Input:
<!DOCTYPE   
Expected tokens:
[["DOCTYPE",null,null,null,false]]
Actual tokens:
[["DOCTYPE","",null,null,false]]
--------------------------------
Failure
<!DOCTYPE\u000A
Input:
<!DOCTYPE

Expected tokens:
[["DOCTYPE",null,null,null,false]]
Actual tokens:
[["DOCTYPE","",null,null,false]]
--------------------------------
Failure
<!DOCTYPE\u000C
Input:
<!DOCTYPE
Expected tokens:
[["DOCTYPE",null,null,null,false]]
Actual tokens:
[["DOCTYPE","",null,null,false]]
--------------------------------
Failure
<!DOCTYPE\u000D
Input:
<!DOCTYPE
Expected tokens:
[["DOCTYPE",null,null,null,false]]
Actual tokens:
[["DOCTYPE","",null,null,false]]
--------------------------------
Failure
<!DOCTYPE 
Input:
<!DOCTYPE 
Expected tokens:
[["DOCTYPE",null,null,null,false]]
Actual tokens:
[["DOCTYPE","",null,null,false]]
--------------------------------
Failure
<!DOCTYPE \u0009
Input:
<!DOCTYPE   
Expected tokens:
[["DOCTYPE",null,null,null,false]]
Actual tokens:
[["DOCTYPE","",null,null,false]]
--------------------------------
Failure
<!DOCTYPE \u000A
Input:
<!DOCTYPE 

Expected tokens:
[["DOCTYPE",null,null,null,false]]
Actual tokens:
[["DOCTYPE","",null,null,false]]
--------------------------------
Failure
<!DOCTYPE \u000C
Input:
<!DOCTYPE 
Expected tokens:
[["DOCTYPE",null,null,null,false]]
Actual tokens:
[["DOCTYPE","",null,null,false]]
--------------------------------
Failure
<!DOCTYPE \u000D
Input:
<!DOCTYPE 
Expected tokens:
[["DOCTYPE",null,null,null,false]]
Actual tokens:
[["DOCTYPE","",null,null,false]]
--------------------------------
Failure
<!DOCTYPE  
Input:
<!DOCTYPE  
Expected tokens:
[["DOCTYPE",null,null,null,false]]
Actual tokens:
[["DOCTYPE","",null,null,false]]
--------------------------------
Failure
<!DOCTYPE >
Input:
<!DOCTYPE >
Expected tokens:
[["DOCTYPE",null,null,null,false]]
Actual tokens:
[["DOCTYPE","",null,null,false]]
--------------------------------
Failure
<!DOCTYPE>
Input:
<!DOCTYPE>
Expected tokens:
[["DOCTYPE",null,null,null,false]]
Actual tokens:
[["DOCTYPE","",null,null,false]]
--------------------------------
Failure
U+0000 in lookahead region after non-matching character
Input:
<!doc>
Expected tokens:
[["Comment","doc"],["Character",""]]
Actual tokens:
[["Comment","doc"],["Character","�"]]
--------------------------------
Failure
CR followed by U+0000
Input:

Expected tokens:
[["Character","\n"]]
Actual tokens:
[["Character","\n�"]]
sideshowbarker commented 4 years ago

I’ve raised a series of pull requests that fix all but one of the failures in the issue description.

Those pull requests are #30, #31, #32, #36, #37, #38, #39, and #40.

With those changes applied, the one remaining tokenization test from the html5lib-tests suite we’re still failing is this one:

https://github.com/html5lib/html5lib-tests/blob/master/tokenizer/domjs.test#L219-L223

"description":"lowercase endtags",
"initialStates":["RCDATA state", "RAWTEXT state", "Script data state"],
"lastStartTag":"xmp",
"input":"</XMP>",
"output":[["EndTag","xmp"]]

And the following is the testharness failure message:

Failure
lowercase endtags
Input:
</XMP>
Expected tokens:
[["EndTag","xmp"]]
Actual tokens:
[["Character","</XMP>"]]

I’ve not had time yet to investigate that one — but so far one thing I’ve noticed which seems odd about it is: Given that it has three states in its initialStates value, I’d expect that the harness would run it three times; but instead the harness apparently evaluates it only once. Also, the untokenized-</XMP>-as-character-data in the Actual tokens output seems to indicate that tokenization isn’t actually being run on the input at all.

sideshowbarker commented 4 years ago

With those changes applied, the one remaining tokenization test from the html5lib-tests suite we’re still failing is this one:

https://github.com/html5lib/html5lib-tests/blob/master/tokenizer/domjs.test#L219-L223

OK, I figured out the cause of that one, and #41 has a fix.

With that fix applied along with the others mentioned in https://github.com/validator/htmlparser/issues/35#issuecomment-671748072, we pass all tokenizer tests in the html5lib-tests suite.