schemedoc / srfi-metadata

Import SRFI metadata into the Scheme API
https://docs.scheme.org/srfi/support/
MIT License
10 stars 2 forks source link

Figure out a full-featured scraping framework #11

Open lassik opened 4 years ago

lassik commented 4 years ago

Currently we use a hand-written scraper for these implementations:

Would be nice to extend the scraper generator in listings.scm so it can handle all of the above.

The scraper generator currently writes Unix shell scripts that use curl, tar and grep, but that's just an implementation detail. If Scheme had a more fully developed archive framework, we could just as well skip the shell scripts and run the scrapers directly in Scheme.

The main thing is to discover a specification language for the scrapers.

lassik commented 4 years ago

The current definitions look like this: https://github.com/schemedoc/srfi-metadata/blob/b1f03fbdc88e61829f47995a9387cee407dfa9cb/listings.scm#L17

lassik commented 4 years ago

IMHO it's a good idea to scrape the listings from manuals whenever possible. Manuals usually describe the most "official" features of the implementation - they tend to put emphasis on polished features, not experimental ones. So if a SRFI is mentioned in the manual, support is probably reasonably complete.

jpellegrini commented 4 years ago

it's a good idea to scrape the listings from manuals whenever possible

Some of the manuals are generated programatically (STklos is one example). Maybe (I don't know) this could be used, somehow.

lassik commented 4 years ago

Yes - in that case it doesn't matter whether we scrape the source or result. Whichever is easier to do. For STklos we currently scrape doc/skb/srfi.stk with one regexp.

erkin commented 3 years ago

I added GitLab support to the scraper generator to integrate Loko and Kawa. I also added support for Snow Fort and chez-srfi, which doesn't depend on Unix shell scripts but depends on external Racket packages. Maybe a compromise can be made.

For Racket, we can scrape racket/srfi repo.
Guile and MIT/GNU Scheme are both hosted on Savannah with cgit, so maybe we can add a third Git host to the scraper generator.
TinyScheme, Ypsilon, Chez, Scheme48, SLIB and Larceny don't seem like they're going to add new SRFIs any time soon, so we can even remove the existing handwritten scrapers and keep static data.

lassik commented 3 years ago

Fantastic. Thanks you very much for continuing to work on this!

erkin commented 3 years ago

No problem! (I had to neglect it for a while though. ^^;)

I added Racket to the scraper generator (although it's not yet complete, see racket/srfi#10). Now only CHICKEN, Guile and MIT/GNU Scheme remain.

erkin commented 3 years ago

CHICKEN down with d25c7b7! Thanks to the work of @diamond-lizard.

lassik commented 3 years ago

I added Racket to the scraper generator (although it's not yet complete, see racket/srfi#10). Now only CHICKEN, Guile and MIT/GNU Scheme remain.

Excellent work!