mysociety / theyworkforyou

Keeping tabs on the UK's parliaments and assemblies
http://www.theyworkforyou.com/
Other
226 stars 52 forks source link

some links to original source for Scotland are broken #468

Open mhl opened 10 years ago

mhl commented 10 years ago

The links to the original source are broken on this page http://www.theyworkforyou.com/sp/?id=2014-03-04.7.0

dracos commented 10 years ago

There are three scrapedxml entries for what is that day's debate on the site - 8984.xml, 8985.xml and 8997.xml. I'm not sure of the process of having multiple XML entries - the correction from 8997 is present, but I think 8984 must have been loaded on top maybe? It is 8997.xml that is on the official site (and changing the source link from 8984 to 8997 shows that it does then work).

8984 => 8985 corrects a speaker's name; 8985 => 8997 has a full correction.

dracos commented 10 years ago

Another example this week - http://www.theyworkforyou.com/sp/?id=2014-08-19.12.0 - 9493 and 9497, with 9497 having the small correction but not being loaded in to the site.

dracos commented 10 years ago

So xml2db.pl is, from a surface reading, meant to spot, in db_addpair, if the same GID is used twice when parsing a day's worth of XML, so I wondered why it wasn't erroring here - however, db_addpair sets $ignorehistorygids{$gid} = 1 and all the functions that call db_addpair just move on if that is set. So duplicate GIDs will always be ignored, not errored on, the error handling can never be reached, and the first GID found is used (hence why the later XML IDs aren't being imported). ignorehistorygids appears to be for redirects of old GIDs, so I can't quite see why db_addpair adds it too, but presumably there was a reason.