unitedstates / congress

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.
https://github.com/unitedstates/congress/wiki
Creative Commons Zero v1.0 Universal
930 stars 202 forks source link

New Data: Nominations #32

Closed konklone closed 11 years ago

konklone commented 11 years ago

Not that it's of any particular urgency, but filing this ticket to reflect a conversation over at unitedstates/wish-list#7 by @wilson428 and @dwillis about getting nominations from THOMAS.

wilson428 commented 11 years ago

I can take a stab here unless @dwillis wants to go first. My original python scripts submitted POST requests by congress and crawled the HTML comments on the results pages for the nominations, which contain the most structured information. Do you have a recommendation for where in this project to start? Can I piggyback on any existing tasks?

dwillis commented 11 years ago

Have at it; I can take a look in a few days, I bet. Start with the bill scraper for reference.

konklone commented 11 years ago

A quick summary of how the bill stuff works: the bill scraper's divided into two parts: bills.py and bill_info.py. bills.py takes care of paginating through lists (figuring out which IDs to go fetch details for), and then makes repeated calls to bill_info.py for details on individual bills by the IDs it identified.

bills.py makes use of a little processing function we put into utils, utils.process_set, which takes an ID (a bill ID) and a function to call for each one (bill_info.fetch_bill), and it expects each call to that function to return a small dict with a couple of keys (like 'ok'), and then produces a report when it's done of how many it processed.

bills.py isn't called directly; by offering a run method that accepts an options dict, the "run" script calls it with the name given. So you ./run bills to call bills.py's run method, where the options dict is transformed from the command line flags.

So, my advice - you could start off by doing a nomination_info.py with its own run method that takes an ID (e.g. "pn67-113" for PN67 of the 113th Congress), and write that script to go fetch details for the given nomination. Once you feel good about how that works, make a nominations.py that probably is mostly a copy of bills.py with some things changed, which uses nomination_info as its workhorse for each nomination it discovers.

The nominations pages don't look like they have a whole lot of metadata, so this is a great place to contribute, I think. Happy to help with anything as it comes up.

wilson428 commented 11 years ago

Thanks for the rundown, that's really helpful. Just committed a first go at parsing the nomination pages. More soon.

wilson428 commented 11 years ago

I've got nominations.py working and fetching nominations. Needs work catching joint nominations and split nominations, so I don't think I should add to README yet, but testable here.

Currently only fetching civilian nominations. Military nominations appear much more rote, and much more of a pain since you'll get 800 people nominations in one swoop, so included not to include.

 ./run nominations --congress=109 --limit=10
JoshData commented 11 years ago

This is a great start!

konklone commented 11 years ago

This is a super great start. I'm traveling this weekend and can't give it much real testing time, but just looking over the commits, a couple thoughts -

I'm so happy this is happening - I've been wanting to use this data for a while.

I wonder how matchable-up this is with the Plum Book?

wilson428 commented 11 years ago

Thank you! I definitely want to match with plum book so that one can filter by pay grade. Also going to add "is_cabinet" variable and so forth in order to pare down the thousands of nominees into most important ones. Need to do some stuff for my day job today but can probably work in the --nomination_id flag and a few other things.

wilson428 commented 11 years ago

Made a little progress with converting text from nomination pages into fielded data, plus some error catching as Eric suggested. Still more to do, like fielding the result of the nomination and calculating the number of days elapsed between nomination and conclusion.

konklone commented 11 years ago

Awesome! And thinking more on the plum book thing - you can probably take a similar approach that we did with legislator/committee info. The project will just always clone the newest version of unitedstates/congress-legislators in order to do the ID crosswalking that the govtrack XML output requires. You could do the same with the plum book data from its own repo, if you wanted to have that data available to you in the nominatoins script.

On Sun, Feb 3, 2013 at 2:11 PM, Chris Wilson notifications@github.comwrote:

Made a little progress with converting text from nomination pages into fielded data, plus some error catching as Eric suggested. Still more to do, like fielding the result of the nomination and calculating the number of days elapsed between nomination and conclusion.

— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/32#issuecomment-13051916.

Developer | sunlightfoundation.com

konklone commented 11 years ago

I just put this scraper through a ton of work, and it now produces reliable data on single nominations, batch military nominations, normalizes committee names, and is set up to choke on any unexpected data. I've tested it from the 111th onwards, and am in the process of downloading earlier Congresses nomination data to fix any bugs on older stuff (nomination pages aren't yet in the cache directories I have).

I'll update the README to include it as one of the major things you can get with this project.

@wilson428, thank you so much for starting this -- I am so glad I did not have to solve the awful parsing problems, HTML comment extraction, URL construction, and session mgmt + POSTing stuff. THOMAS' pages for nominations are way worse than for bills, but I feel good about this data now.

wilson428 commented 11 years ago

Accidentally commented before with the wrong Github identity, but happy to do it. Looks like a huge improvement you made. Thx!

konklone commented 10 years ago

For anyone using the nominations scraper, you'll want to do an update - I just patched it to drop the automatic caching of the search results page, and I fixed up the POST request it makes to include the same max range that THOMAS uses (5000) and to include military nominations.