simonw / scrape-open-data

Scrape various open data directories to create an index of what's available out there
https://open-data.datasette.io
28 stars 2 forks source link

Run scraper with --stats once per week #2

Closed simonw closed 2 years ago

simonw commented 2 years ago

This is so I don't get lots of tiny diffs because of page view and download counts incrementing all the time.

I built the script with this in mind - it only writes the stats information out - as separate files - if you include --stats: https://github.com/simonw/scrape-open-data/blob/626c4cbe62ddcc4c88a57f56c69f3b6173b50d3d/scrape_socrata.py#L28-L31

simonw commented 2 years ago

I can use this pattern: https://til.simonwillison.net/github-actions/different-steps-on-a-schedule

simonw commented 2 years ago

Note that with this change the action no longer scrapes on a commit - it only scrapes on workflow_dispatch or when the schedules trigger.

simonw commented 2 years ago

Running now with workflow_dispatch which should populate the stats files for the first time.

simonw commented 2 years ago

Yup, that added the stats files: https://github.com/simonw/scrape-open-data/commit/1a09c87640cddd324031e942e30d9e89f47e51e9

simonw commented 2 years ago

I manually ran it again to check I got some diffs and I did: https://github.com/simonw/scrape-open-data/commit/2060f3840c342c98ce80a15a9f54fb93b33e1bc6#diff-6835345cbfec8fbf1dfeaee6534859a57591cf163fae958e06defcc40f87b969