Test the encoding sniffing algorithm (aka meta prescan)

This change adds a preparsed subdirectory in the encoding directory, with tests for which the result of the encoding sniffing algorithm at https://html.spec.whatwg.org/#encoding-sniffing-algorithm is the expected result — that is, tests for which the expected result is the output of running only the encoding sniffing algorithm (of which the main sub-algorithm is the so-called “meta prescan”) — without also running the tokenization state machine and tree-construction stage.

This change also adds a README file that explicitly documents what the expected results for the encoding tests are, based on whether or not they’re in the preparsed subdirectory.

Without those changes, it’s unclear whether the expected results shown in the existing tests are for the output of fully parsing the test data — through the tokenization state machine and tree-construction stage — or instead just the output of the encoding sniffing algorithm only. And without those changes, we also don’t have any tests a system can use for testing only the output from the encoding sniffing algorithm.

Fixes https://github.com/html5lib/html5lib-tests/issues/28

There's still a part of me that would rather we just add a #preparsed section to the existing tests as and when it differs from the final state from the parser, as in the vast majority of cases they can be used for both the pre-parser and the whole system.

I thought about that too, but I think that’s less preferable because it would push off more work onto the N downstream parser implementors who’d all need to update their test-data parsers in their test harnesses — to teach the testharness about that new #preparsed section name.

In contrast, putting the tests with new test expectations into a new subdirectory allows those N downstream parser implementors to just keep running their existing test harnesses on the existing test data with zero changes needed. And for handling the added preparsed subdirectory, they could just completely ignore it for now. I can imagine for most of them, that will happen automatically, unless their harness code is doing directory recursion — and even in that case, it’s a minor change to just a condition in their code to ignore the preparsed subdirectory.

(also does scripting support matter for testing encoding? how do document.write and <noscript> interact here? because it so we should make sure we have expectations as required)

I think that question is orthogonal to the scope of this PR. But it seems to me the answer is that for the preparsed case, scripting support definitely does not matter. For the “normal” parsed case, I don’t the answer. But since the implementation I work on doesn’t have scripting support anyway, I’m not personally super-motivated to spend time investigating it…

I thought about that too, but I think that’s less preferable because it would push off more work onto the N downstream parser implementors who’d all need to update their test-data parsers in their test harnesses — to teach the testharness about that new #preparsed section name.

How much work is that for most downstream parsers?

In contrast, putting the tests with new test expectations into a new subdirectory allows those N downstream parser implementors to just keep running their existing test harnesses on the existing test data with zero changes needed. And for handling the added preparsed subdirectory, they could just completely ignore it for now. I can imagine for most of them, that will happen automatically, unless their harness code is doing directory recursion — and even in that case, it’s a minor change to just a condition in their code to ignore the preparsed subdirectory.

I think the majority do do recursion?

I think that question is orthogonal to the scope of this PR.

It is. Just wondering if we might potentially want further flags later, because any way we do this has a cost to some people hence it would be nice to have it happen all at once.

I thought about that too, but I think that’s less preferable because it would push off more work onto the N downstream parser implementors who’d all need to update their test-data parsers in their test harnesses — to teach the testharness about that new #preparsed section name.

How much work is that for most downstream parsers?

For the test harness we’re using the validator.nu parser, I guess it’d be an hour of work, versus minutes of work if we just put the tests into a different directory.

But that said, it’s not absolutely a lot of work.

However, I think what’s more important to consider than how much work it takes is instead, whether it will break anybody’s existing test harnesses. And in that case I can say that adding a new #preparsed section name that our current harness doesn’t know about will break our harness. (I recognize that our harness code should ideally by written to just ignore section names it doesn’t recognize, but that’s not the case now.)

So I personally would prefer not to make changes that have a higher risk of breaking somebody else’s harness code, and instead mitigate that risk by adding the new test types to the existing test suite in a way that seems less likely to break anybody’s existing harness.

I also recognize that it would not be a lot of work for others to update their harness code to recognize a new section name. But I would prefer not to make the choice for them unilaterally.

In contrast, putting the tests with new test expectations into a new subdirectory allows those N downstream parser implementors to just keep running their existing test harnesses on the existing test data with zero changes needed. And for handling the added preparsed subdirectory, they could just completely ignore it for now. I can imagine for most of them, that will happen automatically, unless their harness code is doing directory recursion — and even in that case, it’s a minor change to just a condition in their code to ignore the preparsed subdirectory.

I think the majority do do recursion?

So even if they’re doing recursion, the relatively minor change that anybody could make for now — short of adding full support for the new test types — is to write a single one-to-three-line condition that just causes their harness to completely ignore the preparsed subdirectory for now. And then when they can make time to go back and actually add support for running the new tests, they remove that condition and un-ignore that subdirectory.

For the case of the validator.nu parser now, that’s what we’re already doing for the scripted subdirectories. We don’t have scripting support, so we just permanently ignore any scripted subdirectories.

So I guess I personally would be less happy if the decision had been made previously to add a #scripted section name rather than using the subdirectory mechanism. Because in that case, the test files would — from my perspective — cluttered with additional #scripted tests that I don’t care about and won’t run, but which are mixed in with the rest of the tests that I actually do care about.

@inikulin Given you’re the main person who’s been most active in this repo most recently, it’d be helpful if you weigh in with an opinion on this patch. Specifically, would it better to add a new #preparsed section name, or would you prefer the tests be put into a separate subdirectory (without a new section name being added)?

html5lib / html5lib-tests

Test the encoding sniffing algorithm (aka meta prescan) #130