html5lib / html5lib-tests

Testsuite data for html5lib, including the de-facto standard HTML parsing tests.
MIT License
188 stars 61 forks source link

Future of this repo #127

Open gsnedders opened 4 years ago

gsnedders commented 4 years ago

@whatwg/html-parser (i.e., @wycats @gsnedders @sideshowbarker @zcorpan @inikulin @hsivonen) + other parser maintainers I'm aware of (@lexborisov, @jdm who seems to review html5ever stuff now, @iabudiab) this repo has been pretty unloved for a while (see also: all the open PRs), so we should probably decide what we're doing with it.

As far as I'm aware, the encoding tests are only used by html5lib, the serializer tests are entirely unused (and in the browser case essentially supplanted by html/syntax/serializing-html-fragments/ in WPT), so outright dropping both of these costs us nothing.

The tree-construction tests could quite easily move to WPT, which would also make it less likely that we continue to grow WPT-only parsing tests.

The tokenizer tests are the hardest ones; as far as I'm aware there's various consumers, but these are tests that cannot be run within the browser setting (as the tokenizer isn't even exposed). With my WPT hat on, I don't really object to them living there, but it's also not obvious they should live there.

sideshowbarker commented 4 years ago

The tokenizer tests are the hardest ones; as far as I'm aware there's various consumers, but these are tests that cannot be run within the browser setting (as the tokenizer isn't even exposed). With my WPT hat on, I don't really object to them living there, but it's also not obvious they should live there.

I think we should move them, because there’s compelling value to moving them (together).

Rationale:

Given that we already have other test suites in WPT that cannot be run with the browser setting, and no WPT policy that prohibits adding more such test suites, at least it’s clear that there would be nothing blocking us from choosing to move the tokenizer tests into WPT along with the tree-construction tests.

Among the good reasons for doing that would be, if we’re going to move the tests at all, we should move the tokenizers tests together with the tree-construction tests — not split them up.

I guess it’s implicit that nobody would want to see the tests split up. So I guess the question comes down to whether we should move them into WPT or continue to keep them here.

I think it’s clear that one big advantage of moving them is that they’re more easily findable by the set of people who already knows about WPT or who are likely to discover WPT — which is a much bigger set than the people who already know about this repo or who are likely to find it.

And getting the tree-construction tests into WPT seems like an obvious big win that strongly argues for moving them out from here and into WPT. And if we want to keep the tests together, then that necessarily means moving the tokenizer tests along with the tree-construction tests.

gsnedders commented 4 years ago

FWIW, I agree with @sideshowbarker here; I was trying to avoid leading anyone to any specific decision in the OP.

On the whole my inclination is to migrate this all over in a few weeks unless anyone objects.

jgraham commented 4 years ago

I don't object.

nolanw commented 4 years ago

I use the encoding tests in my lil toy parser. That's no reason to keep them, just wanted to mention that the work wasn't unused :)

I don't know what WPT is (I'm pretty out of the loop), so a handy link to wherever the tests end up would be awesome.

sideshowbarker commented 4 years ago

Hi @nolanw, WPT is https://github.com/web-platform-tests/wpt/

nolanw commented 4 years ago

@sideshowbarker Thank you!

Ms2ger commented 4 years ago

Moving to wpt SGTM.

craigbarnes commented 4 years ago

If these tests are moved to WPT, does that imply changing the format at all?

gsnedders commented 4 years ago

@craigbarnes no, it doesn't

hsivonen commented 4 years ago

Since the non-browser test harnesses for the Validator.nu HTML Parser use the present input formats and are more sensitive to format changes that, as I understand it, the html5lib harness, I'd prefer to avoid format changes and I'd like to keep the non-scripted tree construction tests clearly separate from the scripted ones.

I'm OK with moving these to the WPT repo if it's up to the WPT harness to consume the tests in their current format for in-browser testing.

zcorpan commented 4 years ago

I'm ok with moving to wpt.

untitaker commented 2 years ago

these are tests that cannot be run within the browser setting (as the tokenizer isn't even exposed)

not to derail the thread too much, but couldn't you "run" them in the browser by:

1) constructing tree/dom from the raw input of the test using browser 2) constructing tree/dom from the expected output tokens of the test using some unrelated, non-browser "reference" tree builder implementation 3) compare the resulting doms

gsnedders commented 1 year ago

these are tests that cannot be run within the browser setting (as the tokenizer isn't even exposed)

not to derail the thread too much, but couldn't you "run" them in the browser by:

  1. constructing tree/dom from the raw input of the test using browser
  2. constructing tree/dom from the expected output tokens of the test using some unrelated, non-browser "reference" tree builder implementation
  3. compare the resulting doms

Assuming we have a correct reference, yes.

But this often doesn't actually handle a whole load of the tests (e.g., if the last token didn't actually change the tree in any way—like if it's most closing tags).

gsnedders commented 1 year ago

https://github.com/html5lib/html5lib-tests/issues/141 is perhaps the most interesting potential complexity here—various people have added some CI recently, partly so better pay attention to what line/character offsets parse errors occur at.

cc @fb55, @flavorjones given you're the other people recently active who maintain non-browser implementations

fb55 commented 1 year ago

I opened several PRs in this repo last year that ended up breaking some parser implementation due to aspects that parse5 doesn't use (usually error locations), prompting a fix after the PR was merged. That experience is quite negative, and having CI set up for relevant parsers would be a big win.

As this issue is about porting this repo to WPT: WPT can definitely run the CI jobs that were proposed, but given how small of a part parser tests would be, I can see how this can become an issue.

I do enjoy the simplicity of using the parser tests as a submodule, and how I can use dependabot to open a PR when something was merged. The availability of more reviewers might outweigh all downsides though.

zcorpan commented 1 year ago

Which parsers are the relevant parsers that we should cover in CI? (Edit: oh, this is #141 )

I think a benefit of having the test data in wpt is that it's easier for browser engineers to fix and add tests, given existing syncing infra with wpt. For the dependabot usecase, would it work to automatically import test data to this repo from wpt when the test data changes in wpt?

fb55 commented 1 year ago

For the dependabot usecase, would it work to automatically import test data to this repo from wpt when the test data changes in wpt?

Dependabot can update submodules automatically. Submodules are always for an entire repository, so a move to WPT would add a lot of noise and I'd have to find another solution. I like the convenience of the current setup, but care much more about html5lib-tests being maintained.

zcorpan commented 1 year ago

Yeah so if we move the test data to wpt but also set up automation in this repo to copy the test data back to this repo whenever the test data in wpt changes, it seems to me that it would work the same for downstream projects in terms of updating tests. It would only change where to contribute test changes (wpt instead of this repo).

fb55 commented 1 year ago

That makes a lot of sense to me. As a contributor, I'd still like to see downstream tests, but from a consumer's perspective I don't think moving the WPT would be much of an issue.