unitedstates / wish-list

A wish list for this organization, open an Issue to discuss what we can add. Derived from a News Foo session.
https://github.com/unitedstates/wish-list/issues
16 stars 3 forks source link

Scraping White House Petitions #7

Closed wilson428 closed 11 years ago

wilson428 commented 11 years ago

I wrote a few simple Python scripts to crawl the White House petitions site and search Twitter for links to petitions not yet on the public site. Might be worth including here:

https://github.com/wilson428/get.gov/blob/master/petitions/scripts/get_petitions.py

Update: project created at https://github.com/unitedstates/petitions

konklone commented 11 years ago

That's pretty interesting. Could it be adapted to get down all petitions, too, and/or all responses?

wilson428 commented 11 years ago

Sure, looks like responses follow same format: https://petitions.whitehouse.gov/responses

It presently pulls down every open petition on the site. The Twitter search is just ancillary -- is that what you mean?

konklone commented 11 years ago

Sorry, I looked at the code too quickly, I see what you mean.

If you were interested in maintaining a petition scraper repo here, I think that'd fit the banner of unitedstates pretty nicely. My main requests would be to move the OAuth creds to their own config file out of source control, and to have it output data files (e.g. JSON) instead of writing to a SQLite database. But yeah we're pretty open here, and certainly not limited to the legislative branch.

Also! I see the get.gov project has a nominations scraper for THOMAS. Have you seen our THOMAS scraper at https://github.com/unitedstates/congress? It does pretty much everything but nominations now. :) Would you be interested in integrating your code?

wilson428 commented 11 years ago

Oh geez, I can't believe I left the OAuth creds in there -- time to regenerate! Would be happy to maintain here, and integrate nominations in THOMAS

dwillis commented 11 years ago

I can help with the nominations, too. I've got a scraper for it (in Ruby) that we use.

konklone commented 11 years ago

With your powers combined...

@wilson428, I'm making a new repo here called "petitions" - want to regenerate it there? I'll give you access to it, and you can do whatever you want with it there.

wilson428 commented 11 years ago

Sure thing, will move from SQLite to JSON. What's best way to handle re: updating the petition count? Just rewrite files?

konklone commented 11 years ago

Probably, that's what we do with our THOMAS scraper. We only just added a --fast flag that only downloads what's new or recently changed, and even then I'd still want to run a full re-download once a night just to be on the safe side.

konklone commented 11 years ago

And feel free to move it in before doing that stuff - Github's for sausage, not steak. :)

wilson428 commented 11 years ago

Will do soon as I replace SQLite with JSON -- won't be long. Thanks!

wilson428 commented 11 years ago

Got initial commit in, and did my best to reuse the utils.py present in other projects for uniformity.

konklone commented 11 years ago

This is awesome. Thank you for doing this! Using our utils.py is super helpful actually, plus you get some baked in rate limiting via scrapelib.

I just added a requirements.txt with known dependencies, and updated the README with a setup command that uses it. I also made petitions.py executable, and added a gitignore for pyc files and the data/cache folders.

This is a very neat dataset, I'm going to spread this around. And close this ticket!

wilson428 commented 11 years ago

Thank you!