Closed konklone closed 11 years ago
I can take a stab here unless @dwillis wants to go first. My original python scripts submitted POST requests by congress and crawled the HTML comments on the results pages for the nominations, which contain the most structured information. Do you have a recommendation for where in this project to start? Can I piggyback on any existing tasks?
Have at it; I can take a look in a few days, I bet. Start with the bill scraper for reference.
A quick summary of how the bill stuff works: the bill scraper's divided into two parts: bills.py and bill_info.py. bills.py takes care of paginating through lists (figuring out which IDs to go fetch details for), and then makes repeated calls to bill_info.py for details on individual bills by the IDs it identified.
bills.py makes use of a little processing function we put into utils, utils.process_set, which takes an ID (a bill ID) and a function to call for each one (bill_info.fetch_bill), and it expects each call to that function to return a small dict with a couple of keys (like 'ok'), and then produces a report when it's done of how many it processed.
bills.py isn't called directly; by offering a run
method that accepts an options dict, the "run" script calls it with the name given. So you ./run bills
to call bills.py's run method, where the options dict is transformed from the command line flags.
So, my advice - you could start off by doing a nomination_info.py with its own run method that takes an ID (e.g. "pn67-113" for PN67 of the 113th Congress), and write that script to go fetch details for the given nomination. Once you feel good about how that works, make a nominations.py that probably is mostly a copy of bills.py with some things changed, which uses nomination_info as its workhorse for each nomination it discovers.
The nominations pages don't look like they have a whole lot of metadata, so this is a great place to contribute, I think. Happy to help with anything as it comes up.
Thanks for the rundown, that's really helpful. Just committed a first go at parsing the nomination pages. More soon.
I've got nominations.py working and fetching nominations. Needs work catching joint nominations and split nominations, so I don't think I should add to README yet, but testable here.
Currently only fetching civilian nominations. Military nominations appear much more rote, and much more of a pain since you'll get 800 people nominations in one swoop, so included not to include.
./run nominations --congress=109 --limit=10
This is a great start!
This is a super great start. I'm traveling this weekend and can't give it much real testing time, but just looking over the commits, a couple thoughts -
utils.process_set
, once you use it, will auto-catch exceptions, note them at the end, and email them if you've got credentials in config.yml. If you pass --raise
, it'll let it crash the script.fetch_nomination
calls at the bottom could be rendered moot by using the --nomination_id
flag? Its only purpose is to aid in development that way and reduce the chances of random test lines getting committed uncommented.I'm so happy this is happening - I've been wanting to use this data for a while.
I wonder how matchable-up this is with the Plum Book?
Thank you! I definitely want to match with plum book so that one can filter by pay grade. Also going to add "is_cabinet" variable and so forth in order to pare down the thousands of nominees into most important ones. Need to do some stuff for my day job today but can probably work in the --nomination_id flag and a few other things.
Made a little progress with converting text from nomination pages into fielded data, plus some error catching as Eric suggested. Still more to do, like fielding the result of the nomination and calculating the number of days elapsed between nomination and conclusion.
Awesome! And thinking more on the plum book thing - you can probably take a similar approach that we did with legislator/committee info. The project will just always clone the newest version of unitedstates/congress-legislators in order to do the ID crosswalking that the govtrack XML output requires. You could do the same with the plum book data from its own repo, if you wanted to have that data available to you in the nominatoins script.
On Sun, Feb 3, 2013 at 2:11 PM, Chris Wilson notifications@github.comwrote:
Made a little progress with converting text from nomination pages into fielded data, plus some error catching as Eric suggested. Still more to do, like fielding the result of the nomination and calculating the number of days elapsed between nomination and conclusion.
— Reply to this email directly or view it on GitHubhttps://github.com/unitedstates/congress/issues/32#issuecomment-13051916.
Developer | sunlightfoundation.com
I just put this scraper through a ton of work, and it now produces reliable data on single nominations, batch military nominations, normalizes committee names, and is set up to choke on any unexpected data. I've tested it from the 111th onwards, and am in the process of downloading earlier Congresses nomination data to fix any bugs on older stuff (nomination pages aren't yet in the cache directories I have).
I'll update the README to include it as one of the major things you can get with this project.
@wilson428, thank you so much for starting this -- I am so glad I did not have to solve the awful parsing problems, HTML comment extraction, URL construction, and session mgmt + POSTing stuff. THOMAS' pages for nominations are way worse than for bills, but I feel good about this data now.
Accidentally commented before with the wrong Github identity, but happy to do it. Looks like a huge improvement you made. Thx!
For anyone using the nominations scraper, you'll want to do an update - I just patched it to drop the automatic caching of the search results page, and I fixed up the POST request it makes to include the same max range that THOMAS uses (5000) and to include military nominations.
Not that it's of any particular urgency, but filing this ticket to reflect a conversation over at unitedstates/wish-list#7 by @wilson428 and @dwillis about getting nominations from THOMAS.