mysociety / parlparse

The scraper/parser that produces data for TheyWorkForYou, PublicWhip, etc
Other
61 stars 22 forks source link

London assembly scraper #167

Open ajparsons opened 1 year ago

ajparsons commented 1 year ago

This PR replaces the previous scraper to address the change in the london mayor/assembly website https://github.com/mysociety/theyworkforyou/issues/1687

This is also adding some config files for docker and code linters. Linters are restricted to the london-mayors-question folder for the moment.

The scraper talks to the london site in two places:

Because we have no way of knowing which questions have answers, all questions without answers need to be re-queried for an update.

The command to do this looks something like this:

questions.py fetch-unknown-questions --last-week fetch-unstored refresh-unanswered build-xml --outdir temp/

And a version of this has replaced the commented out lines in updatedaterange-parse.

It stores intermediate files in a json_cache directory. A initial populate will need to be done to catch up:

questions.py fetch-unknown-questions 2020-12-20

There have been some updates to the overall requirements.txt - which hopefully shouldn't cause wider problems.

Import running for all info since 2020-12-20 seems to work fine in TWFY:

image