unitedstates / congress

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.
https://github.com/unitedstates/congress/wiki
Creative Commons Zero v1.0 Universal
929 stars 202 forks source link

Historical: Legislator IDs are 0000000 in some historical House votes #46

Closed swt83 closed 11 years ago

swt83 commented 11 years ago

In the vote data for the 110th Congress there are many legislator ids that say "0000000", which is invalid. I guess it's a problem w/ the source data and nothing we can do about it. A workaround might be to compare the date, lastname, state, and chamber to the Legislators table to try and fill in the missing ID value.

JoshData commented 11 years ago

Hi. I'm not seeing that. Can you provide the command line args you're using and also an example filename? Thanks!

swt83 commented 11 years ago

So out of many examples, one is data/108/votes/2004/h405/data.json. If I do a search of the document, one of the legislator ids will be "0000000". I scraped it last night using ./run votes --congress=108 --session=2004 --force.

konklone commented 11 years ago

Yeah, I see it. Run ./run votes --vote_id=h405-108.2004 and then look at data/108/votes/2004/h405. One of the voters is:

{
  "display_name": "Butterfield", 
  "id": "0000000", 
  "party": "D", 
  "state": "NC"
}

The 0's appear in the original data: http://clerk.house.gov/evs/2004/roll405.xml

Something to report to the Clerk, I think.

JoshData commented 11 years ago

Ahha. I hadn't scraped that far back. On GovTrack I used to fall back to name/state (so, fwiw, the data is complete there: http://www.govtrack.us/data/us/108/rolls/h2004-405.xml).

konklone commented 11 years ago

Okay. I'll report it to the Clerk. On Feb 28, 2013 9:07 AM, "Joshua Tauberer" notifications@github.com wrote:

Ahha. I hadn't scraped that far back. On GovTrack I used to fall back to name/state (so, fwiw, the data is complete there: http://www.govtrack.us/data/us/108/rolls/h2004-405.xml).

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/46#issuecomment-14237448 .

JoshData commented 11 years ago

I committed a check for 0000000. Maybe we want to make it an error condition?

konklone commented 11 years ago

OK, I've made 0000000 an error condition. I also made improperly parsing the legis-num an error condition, and added "MOTION" to the list of acceptable values it can have.

(The ticket can stay open 'til the data's fixed.)

konklone commented 11 years ago

This hasn't been fixed yet, and I've confirmed (myself, and with the Clerk) that it only affects this one person, and only a specific time frame: vote No. 405 through No. 544. (The first vote he took following his special election, until the end of that Congress.)

Given that, should I simply hardcode a fix in the scraper for that value for that time?

dwillis commented 11 years ago

I vote yes.

konklone commented 11 years ago

Yeah, I think it can only reduce the amount of error in the scraper's output, even in the long run. I'll do this.

swt83 commented 11 years ago

But if they make the same error w/ a different member, then we won't be able to catch it.

konklone commented 11 years ago

Well, we'd catch it the same way we caught this. And right now, this is causing a big swathe of invalid data. It seems unlikely to happen for another member, especially since we now know the cause - that the guy was specially elected mid-session. So as long as we only do it for House votes between these two numbers in that year, the only way it'll fail us is if it develops for someone else during that specific time period. So the worst case is we'll be in the same situation we're in right now, and the best (and most likely) case is it's all fixed.

GPHemsley commented 11 years ago

Did they give any indication that they would fix the issue upstream?

konklone commented 11 years ago

Yes, but only "at some point".

JoshData commented 11 years ago

This still isn't fixed upstream, btw.

I've replaced the previous fix with a more generic name lookup in 08f4025faba4d25f4ab7c19beab3ca4595756a5d. (Through the 107th Congress there were no bioguide IDs listed for anyone!)