mysociety / theyworkforyou

Keeping tabs on the UK's parliaments and assemblies
http://www.theyworkforyou.com/
Other
224 stars 50 forks source link

New london.gov.uk breaks TheyWorkForYou questions scraper #1687

Open ajparsons opened 1 year ago

ajparsons commented 1 year ago

London has a new website: https://www.london.gov.uk/

This breaks the previous scraper we were using to get Mayor's questions. https://www.theyworkforyou.com/london/

New site has doesn't have a page per session like the previous one - would need to use query date ranges through the search for the equivalent. https://www.london.gov.uk/who-we-are/what-london-assembly-does/questions-mayor/find-an-answer

The members feed comes via wikidata and is unaffected.

As a first action, we should contact them, raise awareness of the issue, and see if we can get a nicer data feed to work with rather than writing a new scraper. Assigning myself to keep track.

ajparsons commented 1 year ago

Have sent a message about a data feed.

ajparsons commented 1 year ago

Had a reply, and they've suggested they should be creating an RSS feed:

It’s a good suggestion re adding a feed for this information - while it sounds like this has happened as you were scraping our previous site for this data, our technical team have advised the best solution going forward is that we include Mayor’s Questions as an RSS feed, so that the data is available in a machine readable/accessible format.

Let us know if this will work for you and we can look into building that in over the next month or two. It would be good to understand what would be most useful for you in terms of the info provided.

I think this would work well for us? We'd still have to make a new scraper, but it should be more stable over the long term.

I can have a look at the fields the scraper was originally extracting to pass back to them - do we have any other suggestions about format? e.g. ability to query by day/month?

dracos commented 1 year ago

I guess the main issue is the one we had with their site - if it’s an RSS feed of questions, how do we get the answers? Question appear before there are answers, e.g. https://www.london.gov.uk/who-we-are/what-london-assembly-does/questions-mayor/find-an-answer/lfb-staff-progression-2 (as opposed to https://www.london.gov.uk/who-we-are/what-london-assembly-does/questions-mayor/find-an-answer/responding-climate-breakdown with an answer on that page.) If they just translated their search page into RSS, we’d get a feed of links to the questions, then have to store/fetch them all every day to see if it yet had an answer in the HTML? How would we know how far the RSS feed went back, and what it contained? I guess if it accepts parameters and so is more an API that happens to output RSS, that might be okay (would we still have to parse out the speaker/question/answer from the RSS, it’s not a very rich output format, after all) Ideally, we’d want to, each morning be able to say “give us an RSS feed for anything that got an answer (or even better, “was updated”) yesterday” and get back something with all that in it. What the format is then doesn’t really matter.

ajparsons commented 1 year ago

I've had a go at the scraper, the big inefficiency is having to requery all non-answered questions. If we merge it, I'll go back to london assembly and see if we can still get a feed for that (speeds us up, less queries for them).