mysociety / parlparse

The scraper/parser that produces data for TheyWorkForYou, PublicWhip, etc
Other
61 stars 22 forks source link

Create divisions data exports from XML #170

Closed ajparsons closed 6 months ago

ajparsons commented 9 months ago

TheyWorkForYou Votes imports processed vote information via a database dump of the public whip reworked as parquets(https://pages.mysociety.org/publicwhip-data/). This isn’t using the calculated tables, but the raw information it’s extracted from the debate XML.

The key tables used are:

We want to be able to create these tables (or something similar) directly from parlparse. This cuts public whip out of the loop - and would also let us add votes for all Parliaments twfy covers, but public whip does not.

I think this only needs to be pw_divisions and pw_vote. The schemas for these are at the link above.

Pw_divisions includes division descriptions added in public whip - we will need to do a comparison between the new data source and the old data source to extract ‘custom division name/description’, which can be re-added in twfy-votes (currently there is a big yaml for this).

There is duplication between pw_mp and a separation set of tables twfy-votes extracts from the people.json (https://pages.mysociety.org/politician_data/) - so this should be able to be cut out without work in parl parse.

Questions:

Pw_vote contains a reference to a membership_id rather than person_id (meaning votes can be easily mapped to current parties through a join) - how does this handle membership changes that are added after the fact (party change etc). Is pw rebuilding a lot of this table rather than just new votes to catch this kind of thing?

It’s useful if this ends up as parquet files somewhere on the internet because these can be directly referenced. This might add a few dependencies to parlparse - or we could export as xmls, which one of the github-based data-repos reprocesses. My sense is the more stuff that is directly in parlparse the better.

I have in twfy-votes moved towards using ‘chamber’ as a field/column name rather than house (to better capture multiple parliaments). Could move this back a step and also do it here - but also easy to change elsewhere if want to be consistent within parlparse.

dracos commented 9 months ago

Sorry, I'm confused by this one. TWFY already has the MP/division/vote tables - is the request here not to get all the division/vote data from the existing parlparse XML directly into TWFY? Would that not be easier than basically duplicating entire datasets? May have totally misunderstood.

I can't see anything updating membership in the PW code, may have missed it, so they won't be being caught, I assume.

ajparsons commented 9 months ago

Ok, so what I actually want is the simplest way to get an up to date parquet file of divisions and votes that can be used for analysis queries elsewhere (and in general is cool as a bulk data API).

As TheyWorkForYou already crunches the XML for divisions, we should just use that? What this might actually be, is a database dump of a few tables from TWFY that can be reprocessed like I'm currently doing with the public whip.

So rather than a split with both feeding from ParlParse, there's a bit of looped data exchange going on between the two.

graph TD 

HansardXML["Hansard XML"]
ParlParse

HumanAnnotator{{"Human input:  <br/>- describe divisions <br/> - create policies <br/> - assign divisions to policies"}}

PublicWhip["TheyWorkForYou Votes"]
TheyWorkForYou["TheyWorkForYou"]

HansardXML --> ParlParse
ParlParse -->|"Speeches and divisions(XML)"| TheyWorkForYou
TheyWorkForYou -->|"Voting data (parquet?)"| PublicWhip
PublicWhip -->|"Divisions and alignment(JSON)"| TheyWorkForYou
PublicWhip --> HumanAnnotator --> PublicWhip
ajparsons commented 6 months ago

https://github.com/mysociety/theyworkforyou/pull/1765 will close this ticket.

Closed in favour of https://github.com/mysociety/theyworkforyou/issues/1759