html5lib / html5lib-tests

Testsuite data for html5lib, including the de-facto standard HTML parsing tests.
MIT License
188 stars 58 forks source link

[Suggestion] Add tokens to tree construction tests #100

Open RReverser opened 7 years ago

RReverser commented 7 years ago

Motivation: Some HTML parsers (e.g. parse5 and our internal parser) provide a streaming mode in which tokenizer works as if it's executed together with tree construction algorithm, and so tokenizer states are correctly adjusted on certain tags.

Such adjustment is necessary for tokenizer to correctly tokenize contents of <script>, <textarea> and other special tags while still allowing streaming pass-through without the overhead of tree construction which might also need to buffer content to move nodes around etc.

Problem: For now, tests for such "tokenizer + parser feedback" combination have to be auto-generated at build time by running tree construction tests in a full parser and observing tokens produced (see https://github.com/inikulin/parse5/blob/master/scripts/generate_parser_feedback_test/index.js for an example). This is potentially error-prone and hardly reusable between different parsers (currently you need to copy the produced tests around).

Suggestion: add tokens property to the existing tree construction tests that would contain adjusted tokens as they should be seen after changing states as per HTML5 spec.


I'm happy to prepare PR myself if this proposal is accepted as it would greatly benefit existing and future streaming HTML parsers, and also, I think, can be used by regular parsers for extra checks.

inikulin commented 7 years ago

Browsers can benefit from this too, AFAIK most browsers have preparsers that are used for content pre-load. With this addition there will be a convenient way to test it.

@RReverser The only concern I have is that it will be way to hard to add new tests written by hand. Can we have some tool that generates token list for given markup?

RReverser commented 7 years ago

You mean for literally new tests? I guess writing them by hand shouldn't be much worse than writing a tree itself or any other tokenizer tests.

gsnedders commented 7 years ago

Really if we do this, we could probably eliminate the tokenizer tests (and really, they're not that great: there's no way to run them against some implementations, like browsers, easily, and hence AFAIK no browser vendor runs them). At the same time, we can't just programmatically generate trees using some implementation because for some the interesting part of the test is something the tree constructor will act like it was there anyway (e.g., <p>foo</p> v. <p>foo)… that said, if we just replaced all known tags with <fake-tag> we should still test everything, I think?

RReverser commented 7 years ago

At the same time, we can't just programmatically generate trees

Well we're not talking about trees, but about pure lists of tokens (unless you mean something else), which should still contain only actual tokens from the source, in the same order, with the only difference being that the tokenizer states were adjusted.

For example, this is how one test generated by mentioned parse5 script currently looks like:

{
    "description": "<script><div></script></div><title><p></title><p><p>",
    "input": "<script><div></script></div><title><p></title><p><p>",
    "output": [
        [
            "StartTag",
            "script",
            {}
        ],
        [
            "Character",
            "<div>"
        ],
        [
            "EndTag",
            "script"
        ],
        [
            "EndTag",
            "div"
        ],
        [
            "StartTag",
            "title",
            {}
        ],
        [
            "Character",
            "<p>"
        ],
        [
            "EndTag",
            "title"
        ],
        [
            "StartTag",
            "p",
            {}
        ],
        [
            "StartTag",
            "p",
            {}
        ]
    ]
},
RReverser commented 7 years ago

Really if we do this, we could probably eliminate the tokenizer tests (and really, they're not that great: there's no way to run them against some implementations, like browsers, easily, and hence AFAIK no browser vendor runs them).

I guess... but for the start, would you accept a PR that adds tokens to tree-construction tests, and then we can discuss tokenizer tests separately?

gsnedders commented 7 years ago

At the same time, we can't just programmatically generate trees

Well we're not talking about trees, but about pure lists of tokens (unless you mean something else), which should still contain only actual tokens from the source, in the same order, with the only difference being that the tokenizer states were adjusted.

I meant for the tokenizer tests.

I guess... but for the start, would you accept a PR that adds tokens to tree-construction tests, and then we can discuss tokenizer tests separately?

Yes.