mysociety / parlparse

The scraper/parser that produces data for TheyWorkForYou, PublicWhip, etc
Other
61 stars 22 forks source link

WIP: Add current Welsh Assembly Members #79

Closed samknight closed 1 year ago

samknight commented 7 years ago

I'm not really sure what you want from the generator but here is a start.

I'm not a python dev so haven't quite worked out how this xml file fits into everything so please feel free to give me a list of requirements you want me to work from.

Many thanks

dracos commented 7 years ago

I've answered in more overall detail on the mailing list, just to answer the sub-points raised here:

Hope that's helpful.

samknight commented 7 years ago

@dracos I actually already have all the data pre 2011 on Your Senedd anyway so I'll work from that for the initial list and create a parser for any updates.

samknight commented 7 years ago

@dracos It seems the historic data in Parlparse is only for those who were members during the 4th and 5th Term. If the member existed solely before 2011 they aren't on that list so the data pre-2011 is minimal.

Is this still a pre-condition for TWFY purposes? If so is there something we can do with getting the historic data into everypolitician for terms 1,2 and 3 instead of doing it through this then we only need to worry about one source within this repo.

dracos commented 7 years ago

Ah right, I see, odd. Also looks like it's missing the region names for regional AMs, or something like that. Might not be worth using at all, then, if you already have the same data elsewhere.

parlparse is a (I think the main) source of the data for EveryPolitician for UK, Scotland, and NI, so having it in this repository would let it be consistent with those. The 4th term data in EveryPolitician is a manual CSV file, so I suppose you could generate similar manual files for the first three terms, but I can see a number of disadvantages to relying solely on EveryPolitician for this – if and when TWFY starts doing Welsh updates, we might need to add a new person quicker than it appears on EveryPolitician; the current data on EveryPolitician is a bit wrong (e.g. Mark Reckless' memberships don't have start/end dates); the memberships/people/etc will all need parlparse identifiers anyway. So I don't think there's any need to worry about more than one source, I'd rather this was the source.

Is there an issue with taking the data you said you have and generating JSON from it in the correct format? I wouldn't worry about updates, I would say any time spent on that might not be worth it given the pace of change within the membership. I would just get the data you already have, assuming it has the necessary content, into the right format, assign parlparse IDs (I think I once suggested using 70,000+), and then merge into the existing people.json dealing with the small number of people who will have been both AMs and MPs/Lords/MLAs/MSPs (probably not the latter two!) – happy to help out with any manual matching there.

samknight commented 7 years ago

Yep ok I can manage that. The data I have was manually imported into YourSenedd in 2010 so I'm going to have to go back and double check it all and I will make an API endpoint from there for future updates.

tmtmtmtm commented 7 years ago

The data in EveryPolitician at the moment is obtained by scraping the Welsh Assembly site but it's not a very good scraper, in large part because the site doesn't seem particularly good for historic information. When the 5th term started, we essentially just had to archive off the previous data, and start fresh with the new one, rather than being able to continue to scrape the members of that term (or indeed get any from the earlier terms). As @dracos notes re Mark Reckless, there are also problems with it not being able to find end dates when a membership ceases or someone changes party.

It may be that all this information is available on the site through a different route, so if you know of anything like that, I'd certainly be happy to update our scrapers. Otherwise I'm also happy to wait for it all to appear in parlparse :)

EveryPolitician has just been funded to work with Wikidata on getting all this sort of information entered there too, so that might be another alternative here. I haven't checked yet to see how good the Welsh Assembly information already is in Wikidata, but if there's a good base of information to build on top of there, then the Assembly is young enough, and small enough, that that might be a plausibly quick route too, especially if we could rustle up a few other people to help out with finding and fixing edge-cases.

samknight commented 7 years ago

@tmtmtmtm @dracos I've updated the request with the structured data for a first draft. Comments appreciated

dracos commented 7 years ago

Looking good! It's not valid JSON at the moment, it appears to be concatenated JSON objects. If you look at people.json, you'll see the top-level structure is:

{
  "memberships": [...],
  "organizations": [...],
  "persons": [...],
  "posts": [...]
}   

It'd be good to be consistent with that – ie. have someone's memberships in the memberships list, rather than within their person object, linked by a person_id. However, that's easy enough to switch around if it's harder to output like that!

We also consider one membership per electoral period/post/party, with the party stored inside the membership using on_behalf_of_id, rather than a separate party membership entry, so e.g. Mark Reckless (who in this JSON has one UKIP membership and one Assembly membership) should have two Assembly memberships, one for his time as UKIP and one for his time as Conservative, and other AMs should have different memberships for each period of the Assembly (this should also be easy enough to split out from term data if your data isn't held like that, I imagine.)

It'd be good for the organization_ids to match the existing ones (e.g. we have labour, this has labour-party). A few of the people lack an identifier ID, I don't know if that's an issue.

In terms of the posts, each membership has a post_id which is an identifier of an entry in the posts list, where the name of the constituency/region is in the area.name key. It looks like for MPs we took name as being 'same' area even if boundaries altered, I don't think it matters if we do the same here or take the 2007 boundary changes as a clean slate.

Again, people.json should give you some examples. Hope that's helpful.

samknight commented 5 years ago

Hi I'm looking to complete this once and for all and get as much data into TheyWorkForYou as Possible. I don't really have time to host and maintain Your Senedd anymore so since the last website change that broke my scraper. I have been slowly turning off the site.

I've noticed that EveryPolitician now has all the data and it's managed by morph.io. I'm assuming that changes what is needed from me quite dramatically.

I am considering creating a morph.io scraper for the XML record of proceedings in a semi structured way but not sure if that would be useful to getting this done or not.

Any suggestions for next step forward?

dracos commented 5 years ago

Hi, good to hear from you :) Happy new year in advance!

I've noticed that EveryPolitician now has all the data and it's managed by morph.io. I'm assuming that changes what is needed from me quite dramatically.

In one way, yes, in another way no - we still need all the AMs to have identifiers that here and TheyWorkForYou will understand. So I think a plan for people looks something like:

I've probably missed something, I'll have a chat with the people who made the EP Welsh stuff to check. But I think that should hopefully be it.

I am considering creating a morph.io scraper for the XML record of proceedings in a semi structured way but not sure if that would be useful to getting this done or not.

You will know better than I when the Assembly updates its source data etc, but FWIW the existing scraper/parsers all run directly from this repo and so if a new scraper was included here it might be easier to include within the current daily run setup etc. But if you'd prefer it to be on morph, I'm sure that would be fine and we could then pull the data from there as needed (either pre- or post- translation into XML that TheyWorkForYou would understand).

[Footnote: Perhaps later could consider splitting up people.json per body and have memberships separate from people - I thought at the time it made sense to have them in one file but it does make it slow to open/parse...]

dracos commented 1 year ago

Superseded by #165, thanks for your help :)