w3c / webref

Machine-readable references of terms defined in web browser specifications
https://w3c.github.io/webref/
MIT License
310 stars 72 forks source link

Near real-time updates to crawled data #486

Open dontcallmedom opened 2 years ago

dontcallmedom commented 2 years ago

In a variety of contexts (CI in particular, but likely also in the context of the data re-used by spec authoring tools), it would be ideal if the content in webref reflected changes in the underlying documents in close to real-time.

One way we could enable this (at least partially) is by having spec repos trigger a webref update for the given spec whenever the main source file of the said spec is updated - this could be typically achieved with a webhook installed at the repo or (more likely for scaling) at the org level.

One issue is that if several updates are processed at the same time, they would likely trigger an error at the time of pushing the results; this could be avoided either using a different timing in how checkouts and crawls are organized, or by doing a full crawl (with HTTP caching optimizations to reduce the time / network impact).

dontcallmedom commented 2 years ago

so it looks like solving https://github.com/w3c/reffy/issues/850 will gets us with ~1min30 as a basis for a no-update workflow run, and updating one spec is probably in the order of ~10s, so running a full crawl might be reasonable approach to this, although we should expect the basis to grow in proportion of the number of specs being crawled.

For the more efficient single-spec update approach, we might be able to use https://github.com/softprops/turnstyle as a way to ensure trigger events are processed sequentially - see also https://github.community/t/race-condition-possible-from-rapidly-executed-concurrent-github-actions/137411/3