Closed lsg551 closed 6 months ago
Article on how to do this: https://github.com/swyxio/gh-action-data-scraping?tab=readme-ov-file
Maybe, add
--emit-tracking-file
that creates or appends to a file for tracking changes in automated scraping processesThis adds another feature: tracking the number of newly added parishes. The file could look like
date ,version ,files scraped
1714910198 ,v0.4.2 , 8442
with
date
being a unix timestampversion
being the used version of matricula-online-scraper
files scraped
being the number of rows in the scraped data
Description
Fetching all parishes (> 8000) usually takes scrapy a few minutes. Also, if many people do this regularly, it unnecessarily stresses Matricula's server. Especially since this data mostly remains untouched because it will only be regularly updated with new entries.