Closed swt83 closed 11 years ago
Hi. I'm not seeing that. Can you provide the command line args you're using and also an example filename? Thanks!
So out of many examples, one is data/108/votes/2004/h405/data.json
. If I do a search of the document, one of the legislator ids will be "0000000". I scraped it last night using ./run votes --congress=108 --session=2004 --force
.
Yeah, I see it. Run ./run votes --vote_id=h405-108.2004
and then look at data/108/votes/2004/h405
. One of the voters is:
{
"display_name": "Butterfield",
"id": "0000000",
"party": "D",
"state": "NC"
}
The 0's appear in the original data: http://clerk.house.gov/evs/2004/roll405.xml
Something to report to the Clerk, I think.
Ahha. I hadn't scraped that far back. On GovTrack I used to fall back to name/state (so, fwiw, the data is complete there: http://www.govtrack.us/data/us/108/rolls/h2004-405.xml).
Okay. I'll report it to the Clerk. On Feb 28, 2013 9:07 AM, "Joshua Tauberer" notifications@github.com wrote:
Ahha. I hadn't scraped that far back. On GovTrack I used to fall back to name/state (so, fwiw, the data is complete there: http://www.govtrack.us/data/us/108/rolls/h2004-405.xml).
— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/46#issuecomment-14237448 .
I committed a check for 0000000. Maybe we want to make it an error condition?
OK, I've made 0000000 an error condition. I also made improperly parsing the legis-num an error condition, and added "MOTION" to the list of acceptable values it can have.
(The ticket can stay open 'til the data's fixed.)
This hasn't been fixed yet, and I've confirmed (myself, and with the Clerk) that it only affects this one person, and only a specific time frame: vote No. 405 through No. 544. (The first vote he took following his special election, until the end of that Congress.)
Given that, should I simply hardcode a fix in the scraper for that value for that time?
I vote yes.
Yeah, I think it can only reduce the amount of error in the scraper's output, even in the long run. I'll do this.
But if they make the same error w/ a different member, then we won't be able to catch it.
Well, we'd catch it the same way we caught this. And right now, this is causing a big swathe of invalid data. It seems unlikely to happen for another member, especially since we now know the cause - that the guy was specially elected mid-session. So as long as we only do it for House votes between these two numbers in that year, the only way it'll fail us is if it develops for someone else during that specific time period. So the worst case is we'll be in the same situation we're in right now, and the best (and most likely) case is it's all fixed.
Did they give any indication that they would fix the issue upstream?
Yes, but only "at some point".
This still isn't fixed upstream, btw.
I've replaced the previous fix with a more generic name lookup in 08f4025faba4d25f4ab7c19beab3ca4595756a5d. (Through the 107th Congress there were no bioguide IDs listed for anyone!)
In the vote data for the 110th Congress there are many legislator ids that say "0000000", which is invalid. I guess it's a problem w/ the source data and nothing we can do about it. A workaround might be to compare the date, lastname, state, and chamber to the Legislators table to try and fill in the missing ID value.