Closed wilson428 closed 11 years ago
That's pretty interesting. Could it be adapted to get down all petitions, too, and/or all responses?
Sure, looks like responses follow same format: https://petitions.whitehouse.gov/responses
It presently pulls down every open petition on the site. The Twitter search is just ancillary -- is that what you mean?
Sorry, I looked at the code too quickly, I see what you mean.
If you were interested in maintaining a petition scraper repo here, I think that'd fit the banner of unitedstates pretty nicely. My main requests would be to move the OAuth creds to their own config file out of source control, and to have it output data files (e.g. JSON) instead of writing to a SQLite database. But yeah we're pretty open here, and certainly not limited to the legislative branch.
Also! I see the get.gov project has a nominations scraper for THOMAS. Have you seen our THOMAS scraper at https://github.com/unitedstates/congress? It does pretty much everything but nominations now. :) Would you be interested in integrating your code?
Oh geez, I can't believe I left the OAuth creds in there -- time to regenerate! Would be happy to maintain here, and integrate nominations in THOMAS
I can help with the nominations, too. I've got a scraper for it (in Ruby) that we use.
With your powers combined...
@wilson428, I'm making a new repo here called "petitions" - want to regenerate it there? I'll give you access to it, and you can do whatever you want with it there.
Sure thing, will move from SQLite to JSON. What's best way to handle re: updating the petition count? Just rewrite files?
Probably, that's what we do with our THOMAS scraper. We only just added a --fast flag that only downloads what's new or recently changed, and even then I'd still want to run a full re-download once a night just to be on the safe side.
And feel free to move it in before doing that stuff - Github's for sausage, not steak. :)
Will do soon as I replace SQLite with JSON -- won't be long. Thanks!
Got initial commit in, and did my best to reuse the utils.py present in other projects for uniformity.
This is awesome. Thank you for doing this! Using our utils.py is super helpful actually, plus you get some baked in rate limiting via scrapelib.
I just added a requirements.txt with known dependencies, and updated the README with a setup command that uses it. I also made petitions.py executable, and added a gitignore for pyc files and the data/cache folders.
This is a very neat dataset, I'm going to spread this around. And close this ticket!
Thank you!
I wrote a few simple Python scripts to crawl the White House petitions site and search Twitter for links to petitions not yet on the public site. Might be worth including here:
https://github.com/wilson428/get.gov/blob/master/petitions/scripts/get_petitions.py
Update: project created at https://github.com/unitedstates/petitions