Closed sideshowbarker closed 3 years ago
We’re failing 30 tests cases in https://github.com/html5lib/html5lib-tests/tree/master/tokenizer/ (see below). At least 10 of them are related to handling of U+0000 NUL characters.
I don’t understand why the Java parser is failing these but the Firefox parser isn’t.
-------------------------------- Failure Raw NUL replacement Input: \u0000 Expected tokens: [["Character","\\uFFFD"]] Actual tokens: [["Character","\\u0000"]] -------------------------------- Failure Raw NUL replacement Input: \u0000 Expected tokens: [["Character","\\uFFFD"]] Actual tokens: [["Character","\\u0000"]] -------------------------------- Failure Raw NUL replacement Input: \u0000 Expected tokens: [["Character","\\uFFFD"]] Actual tokens: [["Character","\\u0000"]] -------------------------------- Failure Raw NUL replacement Input: \u0000 Expected tokens: [["Character","\\uFFFD"]] Actual tokens: [["Character","\\u0000"]] -------------------------------- Failure NUL in CDATA section Input: \u0000]]> Expected tokens: [["Character","\\u0000"]] Actual tokens: [["Character","\\u0000]]>"]] -------------------------------- Failure NUL in script HTML comment Input: <!--test\u0000--><!--test-\u0000--><!--test--\u0000--> Expected tokens: [["Character","<!--test\\uFFFD--><!--test-\\uFFFD--><!--test--\\uFFFD-->"]] Actual tokens: [["Character","<!--test\\u0000--><!--test-\\u0000--><!--test--\\u0000-->"]] -------------------------------- Failure NUL in script HTML comment - double escaped Input: <!--<script>\u0000--><!--<script>-\u0000--><!--<script>--\u0000--> Expected tokens: [["Character","<!--<script>\\uFFFD--><!--<script>-\\uFFFD--><!--<script>--\\uFFFD-->"]] Actual tokens: [["Character","<!--<script>\\u0000--><!--<script>-\\u0000--><!--<script>--\\u0000-->"]] -------------------------------- Failure lowercase endtags Input: </XMP> Expected tokens: [["EndTag","xmp"]] Actual tokens: [["Character","</XMP>"]] -------------------------------- Failure --!NUL in comment Input: <!----!\u0000--> Expected tokens: [["Comment","--!\\uFFFD"]] Actual tokens: [["Comment","--!\\u0000"]] -------------------------------- Failure CDATA content Input: foo ]]> Expected tokens: [["Character","foo "]] Actual tokens: [["Character","foo ]]>"]] -------------------------------- Failure CDATA followed by HTML content Input: foo ]]>  Expected tokens: [["Character","foo  "]] Actual tokens: [["Character","foo ]]> "]] -------------------------------- Failure CDATA with extra bracket Input: foo]]]> Expected tokens: [["Character","foo]"]] Actual tokens: [["Character","foo]]]>"]] -------------------------------- Failure DOCTYPE without name Input: <!DOCTYPE> Expected tokens: [["DOCTYPE",null,null,null,false]] Actual tokens: [["DOCTYPE","",null,null,false]] -------------------------------- Failure Null Byte Replacement Input: Expected tokens: [["Character"," "]] Actual tokens: [["Character","�"]] -------------------------------- Failure <\u0000 Input: <
We’re failing 30 tests cases in https://github.com/html5lib/html5lib-tests/tree/master/tokenizer/ (see below). At least 10 of them are related to handling of U+0000 NUL characters.
I don’t understand why the Java parser is failing these but the Firefox parser isn’t.