html5lib / html5lib-tests

Testsuite data for html5lib, including the de-facto standard HTML parsing tests.
MIT License
188 stars 61 forks source link

Parser tests without error cases #102

Open wycats opened 7 years ago

wycats commented 7 years ago

We attempt to implement a valid parser for a large subset of valid HTML in Glimmer (including SVG, including foreign elements), but unfortunately we cannot use these tests.

The reason is that the tests liberally make use of invalid HTML (missing doctypes, no closing tags, etc.) for convenience, but since Glimmer is an offline build tool, we report real errors in these cases.

Right now, we make a best-effort attempt at conformance (we fix bugs when they're reported) but I just noticed a bug in our table handling that makes me wish we could just use an exhaustive suite directly.

Is there any interest in avoiding gratuitous use of invalid HTML in the tests, or is it too late for that?

inikulin commented 7 years ago

Is there any interest in avoiding gratuitous use of invalid HTML in the tests, or is it too late for that?

Testing of invalid HTML is one of reasons why this test suite exists in the first place. However, we are currently working on standardisation of parse errors both in spec and tests. Part for tokenisation was already landed and we just started work on tree construction stage. So eventually you will be able to get subset of test suite for valid HTML by just filtering out tests that have parse errors specified.

wycats commented 7 years ago

Testing of invalid HTML is one of reasons why these test suite exists in the first place.

I certainly want parse errors to be specified and in fact it's critical to me that I report the correct parse errors in my offline compiler.

However, we are currently working on standardisation of parse errors both in spec and tests. Part for tokenisation was already landed and we just started work on tree construction stage. So eventually you will be able to get subset of test for valid HTML by just filtering out tests that have parse errors specified.

Today, many tests use parse errors gratuitously. For example, it's hard (maybe impossible, but I can't tell) to find a test for the (valid) case of inserting a <tbody> if a <tr> is found directly inside of a <table>.

Here's an example test:

#data
<table><td>
#errors
(1,7): expected-doctype-but-got-start-tag
(1,11): unexpected-cell-in-table-body
(1,11): expected-closing-tag-but-got-eof
#document
| <html>
|   <head>
|   <body>
|     <table>
|       <tbody>
|         <tr>
|           <td>

This test is really testing a very simple, valid thing, but because of the way it happens to be tested, it ends up producing a lot of errors that I would have to somehow know how to ignore. It's not obvious to me how to ignore these errors reliably. Maybe there's a good trick?

Also, is the idea of the standardization to make tests like this one avoid gratuitously invalid content? That's more or less what I'm asking for.

inikulin commented 7 years ago

OK, I'm a bit confused now. What's the problem with errors if your parser can recover from errors like any spec compatible parser do? Or I'm missing something?

inikulin commented 7 years ago

Or you just want more test cases to be added for valid markup? If so, then it's up to you, we are happy to accept PRs.

wycats commented 7 years ago

What's the problem with errors if your parser can recover from errors like any spec compatible parser do?

According to the spec:

This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined: user agents must either act as described below when encountering such problems, or must abort processing at the first error that they encounter for which they do not wish to apply the rules described below.

We abort, in conformance with the spec, but the tests don't have a mode to test this conformant scenario.

inikulin commented 7 years ago

Thank you for the clarification. The only thing that we can do at the moment is just to add more tests for valid scenarios. Any contribution is welcome.

wycats commented 7 years ago

@inikulin would you take patches to add missing doctypes and closing tags in existing tests that gratuitously do not include them?

wycats commented 7 years ago

@inikulin another possible refactoring could be to differentiate between real errors and errors that only arise as a result of early termination of the HTML document.

inikulin commented 7 years ago

We would like to keep existing tests as is, since they are widely used by implementations that gracefully handle these errors (i.e. browsers). But we are open to accept new tests even if they are just valid versions of old one, you can place them in separate file, e.g. valid.dat and valid.json.

wycats commented 7 years ago

@inikulin would you be open to adding more metadata to existing tests, like indicating whether an error is due to early EOF? Can you think of a way to do that that wouldn't break existing consumers?

inikulin commented 7 years ago

Well, parse errors is such metadata. It's well documented which tokenisation errors occur due to unexpected EOF.

inikulin commented 7 years ago

Not implemented yet for tree construction stage, though

wycats commented 7 years ago

@inikulin what I'm saying is that from my perspective there are "errors I am specifically trying to assert occur" and "errors that occur as a side effect of the way we wrote the test".

Errors that occur due to unexpected EOF can be treated specially in my tests, because I have the partially constructed tree that I can compare.

Maybe all I really need (and maybe this is already true) is that all errors that occur due to unexpected EOF end in something like but-got-eof. It's probably even sufficient for me to enumerate the errors myself, but this would be brittle if things change and more but-got-eof style errors were added in the future.

inikulin commented 7 years ago

It's probably even sufficient for me to enumerate the errors myself, but this would be brittle if things change and more but-got-eof style errors were added in the future.

It's very unlikely that there will be new errors added in near future and if they would it will be a significant change to HTML syntax itself, so you will need to update your parser anyway. Meanwhile, you can find standard parse errors and their descriptions in spec: https://html.spec.whatwg.org/multipage/parsing.html#parse-errors

inikulin commented 7 years ago

(Apart from tree construction stage errors that are not in spec yet)

inikulin commented 7 years ago

My vision is that if any implementation requires some subset of tests based on some assertions it should create these subset on it's side and we provide all required prerequisites for it, such as standard and well documented error codes. Or if it's missing some test cases we are open to accept new tests. I don't see any reason why some ad hoc solutions should get into this test suite.

inikulin commented 7 years ago

Thinking of it a bit more, we can add support for implementations that bailout on parse errors. We can achieve this by inserting marker in expected markup that shows position of first parse error. So during testing it will be possible to skip comparison of AST after this marker. However, it might not work in all cases, such as when error recovery mechanism modify AST that is already parsed before error point, e.g. in case of foster parenting. But we can mark these tests somehow as well, so they can be completely skipped \cc @gsnedders

wycats commented 7 years ago

@inikulin if there's a strategy that you're happy with, I'm willing to do some work to update the tests in that way.

One issue with the current tests is that a lot of the cases of "first parse error" is "missing doctype".

I've been thinking more about the issue of keeping the existing suite working for existing impls. One approach might be to accept some improvements to the metadata description, but write a script to emit the original format. We could then confirm 0 diffs in the original format, but I could enrich the test definitions with more information.

inikulin commented 7 years ago

@wycats Sorry, I've missed your last reply somehow. In my opinion for now the best solution will be the one that doesn't require any change in the existing test format. So, I vote for adding new tests in the separate file (they can be modified existing tests).