mysociety / parlparse

The scraper/parser that produces data for TheyWorkForYou, PublicWhip, etc
Other
61 stars 22 forks source link

How could we relate the historichansard_id IDs to historic Hansard slugs? #91

Open mhl opened 7 years ago

mhl commented 7 years ago

I'm frustrated by this, because I think one of the first things I did when working for mySociety was working with @frabcus on importing people from the historic Hansard data but I can't remember enough of the detail to be able to answer my own question!

The Wikidata project has imported all the historic MPs from the historic Hansard records from http://hansard.millbanksystems.com/ using the slugs on people pages as IDs - this is Wikidata property P2015. parlparse, however, uses IDs for historic MPs with the scheme historichansard_id which is numeric. If we could find the mapping between these two ID spaces, that would able us to straightforwardly associate everyone in parlparse with the right Wikidata items, which would be brilliant.

The problem is that I can't find any use of the historichansard_id values on http://hansard.millbanksystems.com/ at all now. It's not in the source of people pages or debate pages on that site. The credits page links to the XML data that site is based on: http://www.hansard-archive.parliament.uk/ but those don't appear to have IDs associated with members at all - the <member> ... </member> tags have no attributes, and I can't see any other element that has them. (This is all worth double-checking, I should say!)

Can anyone help with figuring this out? Is it possible that we used a different structured data source from those XML files when importing the historic MPs into parlparse, and I'm just not finding it now? (Looking through the history of this repository, I can't even see what script might have been used for the import now, though I imagine we did commit it.)

If the historichansard_ids were the database primary keys for the Rails site hosted here: http://hansard.millbanksystems.com/ (source code here: https://code.google.com/archive/p/hansard/downloads ) then perhaps we could get a dump of that mapping from the maintainers?

To help with checking this kind of thing, an example:

n.b. some people in parlparse have the ID scheme historichansard_person_id and some have historichansard_id - I'm assuming they're the same ID space, but maybe not.

Cc: @dracos @crowbot

dracos commented 7 years ago

https://code.google.com/archive/p/hansard/downloads contains database dumps under reference_data. TWFY has its old import code for this in scripts/historic (in TWFY repo). I can probably do better with more time, but hopefully that's enough for this. ... yep, people.json has Diane Abbott historichansard_person_id of 7, and commons_library_data/people.sql has her under key 7.

lizconlan commented 7 years ago

For reference, the Historic Hansard code is also available on GitHub https://github.com/millbanksystems/hansard and the running site is no longer a Rails app, it's (effectively) a flat file backup + a Sinatra app to replicate the original search functionality (um, https://github.com/lizconlan/hh-search-app I think, I should probably transfer that to the correct ownership)

I have a full database backup somewhere...

dracos commented 7 years ago

"n.b. some people in parlparse have the ID scheme historichansard_person_id and some have historichansard_id - I'm assuming they're the same ID space, but maybe not" - no, as with us, one is a person ID, one is a membership ID.

mhl commented 7 years ago

Thanks, @dracos and @lizconlan - that's brilliant.