Fix tokenizer EOF error positions

html5lib / html5lib-tests

Testsuite data for html5lib, including the de-facto standard HTML parsing tests.

MIT License

188 stars 61 forks source link

Fix tokenizer EOF error positions #144

Closed fb55 closed 2 years ago

fb55 commented 2 years ago

I am trying to move parse5 to the upstream html5lib-tests repo (away from this fork). As a first PR to come from this effort, this PR corrects some tokenizer errors. The changes are in three categories:

Off-by-one errors for EOF errors. Most EOF errors already point at the column after the last character, with some exceptions. These exceptions were fixed.
Line breaks being ignored by some EOF errors. Similar to (1), these are the exception.
~~unknown-named-character-reference errors were missing entirely and have been added.~~ Reverted.

untitaker commented 2 years ago

In fact if you check the spec, &noti.. is the exact example they use to describe that edgecase: https://html.spec.whatwg.org/multipage/parsing.html#named-character-reference-state:

if the markup contains the string I'm ¬it; I tell you in an attribute, no character reference is parsed and string remains intact (and there is no parse error).

fb55 commented 2 years ago

Thanks a lot for flagging @untitaker. I've reverted the additions.

untitaker commented 2 years ago

error locations are not actually standardized, right? this is just to make the testsuite internally consistent?

fb55 commented 2 years ago

error locations are not actually standardized, right? this is just to make the testsuite internally consistent?

That is correct.

fb55 commented 2 years ago

@Ms2ger It would be great if you could have another look at this (as well as #145 if possible)!