w3c / reffy

Reffy is a Web spec crawler and analyzer tool. It is notably used to update Webref
MIT License
69 stars 23 forks source link

Extract normative algorithms defined in specs #1614

Closed tidoust closed 2 months ago

tidoust commented 2 months ago

This adds a browserlib module that creates extracts filled with information about algorithms defined in the spec. The extracts are rather raw: they just capture the tree structure of algorithms and copy the HTML for each step. This makes the resulting extracts less directly useful but also makes it possible to run all sorts of analyses on them (see related PR in Strudy: https://github.com/w3c/strudy/pull/645 and the current results of running the analysis).

This would add about 50MB of additional data to a Webref crawl result. That's significant, roughly equivalent to the IDs extracts, which are the heaviest for now.

The structure of the extracts is very likely going to change substantively as we learn from experience!

There are plenty of things that could be improved. The code contains TODOs for main ones.

An algorithm extract is essentially an object with the following keys:

Each step is essentially an object that follows the same structure as an algorithm, except that it does not have a name and href keys, and may also have the following keys:

dontcallmedom commented 2 months ago

https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#duration-time-component is extracted as an algorithm, which it isn't really; not sure yet what we can do something about it

tidoust commented 2 months ago

https://html.spec.whatwg.org/multipage/common-microsyntaxes.html#duration-time-component is extracted as an algorithm, which it isn't really; not sure yet what we can do something about it

That's precisely the sort of steps that were ignored when the code was only checking the first step in a list. But then extraction missed a number of real algorithms as a result. There's room for improvement, for sure.

tidoust commented 2 months ago

I mentioned updating the README, but that's more for Webref in practice. Now, a couple of additional things to do before merge:

  1. src/browserlib/reffy.json needs to be completed with the new module, otherwise it won't run by default. I'll do it.
  2. I noticed that the crawl crashes partially on a few specs (including Web Audio API, WebXR, and Web Authentication), falling back to a regular network request instead of reusing the information cache. Error is "Body is unusable". I need to look into that.
tidoust commented 2 months ago
1. `src/browserlib/reffy.json` needs to be completed with the new module, otherwise it won't run by default. I'll do it.

Done.

2. I noticed that the crawl crashes partially on a few specs (including Web Audio API, WebXR, and Web Authentication), falling back to a regular network request instead of reusing the information cache. Error is "Body is unusable". I need to look into that.

Well, That no longer seems to happen. Now, I don't understand how that error could appear in the first place (plus, that seemed reproducible), and what change made the error disappear.

dontcallmedom commented 2 months ago

feel free to merge if you're satisfied with my changes to address your comments

tidoust commented 2 months ago

I'm running a crawl locally to review extracts one more time before merge.

tidoust commented 2 months ago

I made a few adjustments, with related tests. The single-step algorithms seem to get correctly extracted without creating duplicates now (with the CSP exception, which I'm happy to live with for now). Good to merge?