unitedstates / congress

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.
https://github.com/unitedstates/congress/wiki
Creative Commons Zero v1.0 Universal
932 stars 202 forks source link

Nominations Scraper -> congress.gov #173

Open crdunwel opened 8 years ago

crdunwel commented 8 years ago

We need to transition the nominations scraper to congress.gov because thomas.gov is shutting down.

JoshData commented 8 years ago

@dwillis points out that there's XML at http://www.senate.gov/general/XML.htm, so we should transition the scraper to use that instead.

veselins commented 8 years ago

[see #177 --- comment deleted by @JoshData]

kevinschaul commented 7 years ago

Hi all, I'm may take a stab at this. A few thoughts before I dive in:

The XML referenced above seems to show only nominations for the current term. Is there a preferred strategy for handling nominations that predate whatever the current term is?

Unfortunately I don't see a great source data format for nominations. Even those XML files would require parsing names of nominees (or getting that information from another source based on the nomination number PNXXX).

I'm considering having the nominations scraper use Congress.gov's search. This example query could work for finding the nominations with the most recent changes, and the details cold be drawn from the nomination action pages. Does that strategy make sense?

Any other comments before I begin?

dwillis commented 7 years ago

@kevinschaul I think it's worth having the XML parser mainly because in theory it would be less brittle than scraping Congress.gov, but I'm not going to die on that hill if others don't feel strongly about it. In either case I think we'd need to do a little bit of parsing to grab the nominee name.

JoshData commented 7 years ago

It would be nice to have the full action history (only on Congress.gov?), but either data source is fine with me.

dwillis commented 7 years ago

Yeah, I think the full action history is better on Congress.gov. So let's go with a convention scraper of that.

AlJohri commented 7 years ago

hi @dwillis @JoshData, I'd be interested in tackling a congress.gov scraper if its still needed. is the task to grab the full action history for each nomination?

dwillis commented 7 years ago

Hey @AlJohri, thanks! That would be great. We'd want each nominations and all actions for it. There are a couple of wrinkles here: some nominations (especially military ones) represent multiple nominees, while most civilian nominations are for a single person. Others are divided into multiple records for the same PN number.

AlJohri commented 7 years ago

sounds good! I made some initial progress for parsing a single nomination. This is what the output looks like: https://gist.github.com/AlJohri/80b5c4a55ddfc04eb2a24a413a0b79cd#file-sample-json

I'll look into a couple of those edge cases you mentioned next

dwillis commented 7 years ago

This looks good so far, thanks!

lwaltman commented 6 years ago

Has any progress been made on this scraper? Where is the current version? I just am looking for a somewhat usable one to use to test.

AlJohri commented 6 years ago

no sorry @lwaltman, but it should be pretty easy to modify https://gist.github.com/AlJohri/80b5c4a55ddfc04eb2a24a413a0b79cd#file-nominations-py

Darokrithia commented 4 years ago

Any updates in 2020? I may be willing to work on this, especially if there already is some infrastructure in place

Darokrithia commented 4 years ago

Also, I am more interested in bill data from 1970-2012, which was a part of the thomas scraper

dwillis commented 4 years ago

@Darokrithia We've got from 1973-on here: https://www.propublica.org/datastore/dataset/congressional-data-bulk-legislation-bills

Darokrithia commented 4 years ago

Perfect! Thank you! You may want to update your documentation. I may be miss reading it, but the documentation for bills implies this data is missing / there is no scrapper. Thank you once again!

dwillis commented 4 years ago

@Darokrithia the documentation is correct in that this project only handles legislation from 2013-onward.

ryparker commented 3 years ago

Any movement on this?