Open crdunwel opened 8 years ago
@dwillis points out that there's XML at http://www.senate.gov/general/XML.htm, so we should transition the scraper to use that instead.
[see #177 --- comment deleted by @JoshData]
Hi all, I'm may take a stab at this. A few thoughts before I dive in:
The XML referenced above seems to show only nominations for the current term. Is there a preferred strategy for handling nominations that predate whatever the current term is?
Unfortunately I don't see a great source data format for nominations. Even those XML files would require parsing names of nominees (or getting that information from another source based on the nomination number PNXXX).
I'm considering having the nominations scraper use Congress.gov's search. This example query could work for finding the nominations with the most recent changes, and the details cold be drawn from the nomination action pages. Does that strategy make sense?
Any other comments before I begin?
@kevinschaul I think it's worth having the XML parser mainly because in theory it would be less brittle than scraping Congress.gov, but I'm not going to die on that hill if others don't feel strongly about it. In either case I think we'd need to do a little bit of parsing to grab the nominee name.
It would be nice to have the full action history (only on Congress.gov?), but either data source is fine with me.
Yeah, I think the full action history is better on Congress.gov. So let's go with a convention scraper of that.
hi @dwillis @JoshData, I'd be interested in tackling a congress.gov scraper if its still needed. is the task to grab the full action history for each nomination?
Hey @AlJohri, thanks! That would be great. We'd want each nominations and all actions for it. There are a couple of wrinkles here: some nominations (especially military ones) represent multiple nominees, while most civilian nominations are for a single person. Others are divided into multiple records for the same PN
number.
sounds good! I made some initial progress for parsing a single nomination. This is what the output looks like: https://gist.github.com/AlJohri/80b5c4a55ddfc04eb2a24a413a0b79cd#file-sample-json
I'll look into a couple of those edge cases you mentioned next
This looks good so far, thanks!
Has any progress been made on this scraper? Where is the current version? I just am looking for a somewhat usable one to use to test.
no sorry @lwaltman, but it should be pretty easy to modify https://gist.github.com/AlJohri/80b5c4a55ddfc04eb2a24a413a0b79cd#file-nominations-py
Any updates in 2020? I may be willing to work on this, especially if there already is some infrastructure in place
Also, I am more interested in bill data from 1970-2012, which was a part of the thomas scraper
@Darokrithia We've got from 1973-on here: https://www.propublica.org/datastore/dataset/congressional-data-bulk-legislation-bills
Perfect! Thank you! You may want to update your documentation. I may be miss reading it, but the documentation for bills implies this data is missing / there is no scrapper. Thank you once again!
@Darokrithia the documentation is correct in that this project only handles legislation from 2013-onward.
Any movement on this?
We need to transition the nominations scraper to congress.gov because thomas.gov is shutting down.