Extract normative algorithms defined in specs

tidoust commented 2 months ago

This adds a browserlib module that creates extracts filled with information about algorithms defined in the spec. The extracts are rather raw: they just capture the tree structure of algorithms and copy the HTML for each step. This makes the resulting extracts less directly useful but also makes it possible to run all sorts of analyses on them (see related PR in Strudy: https://github.com/w3c/strudy/pull/645 and the current results of running the analysis).

This would add about 50MB of additional data to a Webref crawl result. That's significant, roughly equivalent to the IDs extracts, which are the heaviest for now.

The structure of the extracts is very likely going to change substantively as we learn from experience!

There are plenty of things that could be improved. The code contains TODOs for main ones.

An algorithm extract is essentially an object with the following keys:

name: The name of the algorithm, when one exists
href: The URL with fragment to reach the algorithm, when one exists
html: Some introductory prose for the algorithm. That prose may well contain actual algorithmic operations, e.g.: "When invoked, run the following steps in parallel". href/src attributes in the HTML have absolute URLs.
steps: Atomic algorithm steps.

Each step is essentially an object that follows the same structure as an algorithm, except that it does not have a name and href keys, and may also have the following keys:

operation: Gives the name of the main operation performed by the step, when one was identified. So far, that's only for "switch".
case: Used in switch steps to identify the switch condition that triggers the step.
ignored: Ordered lists found at the step level that do no look like algorithm steps. Or maybe they are? The lists should get reviewed: they usually describe inputs/outputs or conditions, but they may signal parts where the extraction logic needs to be improved. The lists are reported as text prose.
additional: Each step should contain one and only one algorithm. When other algorithms are found at the same level, they get reported in that property. That usually either signals that the spec could be improved because if fails to use different list items for different steps, and/or that the extraction logic needs to be smarter.

dontcallmedom commented 2 months ago

https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#duration-time-component is extracted as an algorithm, which it isn't really; not sure yet what we can do something about it

tidoust commented 2 months ago

https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#duration-time-component is extracted as an algorithm, which it isn't really; not sure yet what we can do something about it

That's precisely the sort of steps that were ignored when the code was only checking the first step in a list. But then extraction missed a number of real algorithms as a result. There's room for improvement, for sure.

tidoust commented 2 months ago

I mentioned updating the README, but that's more for Webref in practice. Now, a couple of additional things to do before merge:

src/browserlib/reffy.json needs to be completed with the new module, otherwise it won't run by default. I'll do it.
I noticed that the crawl crashes partially on a few specs (including Web Audio API, WebXR, and Web Authentication), falling back to a regular network request instead of reusing the information cache. Error is "Body is unusable". I need to look into that.

tidoust commented 2 months ago

1. `src/browserlib/reffy.json` needs to be completed with the new module, otherwise it won't run by default. I'll do it.

Done.

2. I noticed that the crawl crashes partially on a few specs (including Web Audio API, WebXR, and Web Authentication), falling back to a regular network request instead of reusing the information cache. Error is "Body is unusable". I need to look into that.

Well, That no longer seems to happen. Now, I don't understand how that error could appear in the first place (plus, that seemed reproducible), and what change made the error disappear.

dontcallmedom commented 2 months ago

feel free to merge if you're satisfied with my changes to address your comments

tidoust commented 2 months ago

I'm running a crawl locally to review extracts one more time before merge.

tidoust commented 2 months ago

I made a few adjustments, with related tests. The single-step algorithms seem to get correctly extracted without creating duplicates now (with the CSP exception, which I'm happy to live with for now). Good to merge?

w3c / reffy

Extract normative algorithms defined in specs #1614