Changes to the data format

html5lib / html5lib-tests

Testsuite data for html5lib, including the de-facto standard HTML parsing tests.

MIT License

188 stars 61 forks source link

Changes to the data format #160

Open annevk opened 1 year ago

annevk commented 1 year ago

I want to make two changes to the ways attributes are serialized to ensure better test coverage:

They are no longer sorted. We enforce insertion order as the specification does.
We serialize their qualified name, including prefix, if any.

As an example, https://github.com/html5lib/html5lib-tests/blob/master/tree-construction/tests10.dat#L388-L401 looks like

#data
<!DOCTYPE html><body xlink:href=foo xml:lang=en><svg><g xml:lang=en xlink:href=foo></g></svg>
#errors
#document
| <!DOCTYPE html>
| <html>
|   <head>
|   <body>
|     xlink:href="foo"
|     xml:lang="en"
|     <svg svg>
|       <svg g>
|         xlink href="foo"
|         xml lang="en"

today and the last part would change to

|       <svg g>
|         xml xml:lang="en"
|         xlink xlink:href="foo"

to account for this. This should improve coverage a bit.

gsnedders commented 1 year ago

Per https://github.com/html5lib/html5lib-tests/issues/127#issuecomment-636665907, @hsivonen said:

Since the non-browser test harnesses for the Validator.nu HTML Parser use the present input formats and are more sensitive to format changes that, as I understand it, the html5lib harness, I'd prefer to avoid format changes and I'd like to keep the non-scripted tree construction tests clearly separate from the scripted ones.

I'm not totally sure what that was specifically about; in principle we've had a format change in https://github.com/html5lib/html5lib-tests/commit/e1f5573bdf53ad80340babde370bc40296eefa12 which means the scripted/not-scripted distinction is normatively within the test.

That said, I think it is fair to say that we should be relatively conservative with making format changes—it incurs work for quite a lot of people, which means we might want to have a discussion about whether there are other format changes we should make at the same time.

hsivonen commented 1 year ago

The Validator.nu harness has been pretty sensitive to the order of the hash-prefixed sections, since the Validator.nu harness reads the test files as a stream instead of treating them as one big random-access thing.

Changes in serialization (like proposed here) are easier to deal with than having the hash-prefixed sections in variable order.

gsnedders commented 1 year ago

Changes in serialization (like proposed here) are easier to deal with than having the hash-prefixed sections in variable order.

FWIW, this is another thing I was trying to sort out in #83 years ago, adding linting to assert the order is what it's meant to be.

not-my-profile commented 1 year ago

How does serializing attributes in insertion order improve test coverage? I'd think that the order doesn't matter.

annevk commented 1 year ago

The order matters.

not-my-profile commented 1 year ago

How does it matter?

annevk commented 1 year ago

I'm not sure what you mean. Element attributes are defined to be an insertion order. This is observable through various APIs.