Closed Camiann closed 5 years ago
Hi! This was a pretty ad-hoc tool, and I don't know whether it still works as expected under any circumstances! But I will look into it when I get a chance. I have done some other regulations.gov scraping using other scripts, and I might try uploading them here, as I'd expect them to be more robust.
Here's a script that I have used more recently that might have a greater chance of working. I don't recall exactly what it provides but perhaps you will find it useful!
import requests
import csv
import time
import sys
api_key = '' # insert your api key between quotes
docket_id = '' # insert the docket id between quotes (e.g. VA-2016-VHA-0011)
total_docs = 217568 # total number of documents, as indicated by the page for the given docket id
docs_per_page = 1000 # maximum number of results per page; no reason to change
url = ('https://api.data.gov:443/regulations/v3/'
'documents.json?api_key={}&dktid={}&rpp={}&po={}')
def make_urls():
return [url.format(api_key, docket_id, docs_per_page, i)
for i in range(0, total_docs, docs_per_page)]
def get(url):
r = requests.get(url)
if r.status_code == 200:
return r.json().get('documents', {})
else:
return {}
def save_batch(batch, ix):
keys = set(k for d in batch for k in d.keys())
with open('batch_{:03d}.csv'.format(ix), 'w', encoding='utf-8') as op:
wr = csv.DictWriter(op, fieldnames=sorted(keys))
wr.writeheader()
wr.writerows(batch)
def fetch_urls(urls):
data = {}
errors = []
urls = make_urls()
for i, url in enumerate(urls):
d = get(url)
if d:
data[url] = d
save_batch(d, i)
else:
errors.append(url)
print('error on url {}'.format(url))
time.sleep(5)
with open('error-urls.txt', 'w', encoding='utf-8') as op:
for e in errors:
op.write(e)
op.write('\n')
def fix_errors(err_file):
with open(err_file, encoding='utf-8') as ip:
urls = ip.readlines()
fetch_urls(urls)
def main():
urls = make_urls()
fetch_urls(urls)
if __name__ == '__main__':
if len(sys.argv) > 1:
errf = sys.argv[1]
fix_errors(errf)
else:
main()
(When I get a chance I'll replace the current one with this and write a more detailed readme.)
Hi there.
New to Github. Need to figure out how to write code to scrape comments off of regulations.gov with the new API, and then off of multiple web pages (though I think that is implied). It's like 11,000 comments. Experienced researcher, new to coding though. Need help.
This, specifically, is the link to the comments I'm working with.
I just sent an email to regulations@erulemakinghelpdesk.com asking for an API key.
D
You should be able to do what you need by replacing the top nine lines of the script quoted above with the below. (You'll still need to insert your own API key, but everything else is as it should be.)
import requests
import csv
import time
import sys
api_key = 'YOUR_API_KEY_HERE' # insert your api key between quotes
docket_id = 'ED-2018-OCR-0064' # insert the docket id between quotes (e.g. VA-2016-VHA-0011)
total_docs = 14835 # total number of documents, as indicated by the page for the given docket id
docs_per_page = 1000 # maximum number of results per page; no reason to change
In case you need to refer back to this, here are screenshots showing where I got the docket ID and document count:
@dtgossi see above! I forgot to @ you.
@dtgossi, hoping the above worked. I'm going to close this since I've now updated the main script and provided a more detailed readme.
Hey Johathan,
Sorry. I subscribed to the python and R mailing list trying to get help, and, in the deluge of emails I got, yours seems to have gotten list in the mix. I apologize for that.
I'm going to go check out the Git Hub now and try and make sense of this.
I did just--just--get the API key authorized.
Drake
On Mon, Feb 25, 2019 at 8:46 AM Jonathan Scott Enderle < notifications@github.com> wrote:
Closed #1 https://github.com/upenndigitalscholarship/regulations-gov-comment-scraper/issues/1 .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/upenndigitalscholarship/regulations-gov-comment-scraper/issues/1#event-2162335946, or mute the thread https://github.com/notifications/unsubscribe-auth/Ati8xvrQyu7xbKmwLAZ_1vaGqOq5uWFyks5vRBN9gaJpZM4VwhoZ .
Wow Johnathan. You really went above and beyond on this one. Thank you so much.
By 9 lines, you must mean this:
[image: 9 lines.jpg]
What about the rest of the lines, though? What do those do? since there's 66 lines of code in this...
And then what do I have to do to to tabulate the name, category, date, and comment, each into its own variable in the csv?
[image: the 4 things.jpg]
Drake
On Wed, Feb 27, 2019 at 2:17 PM Drake Gossi drake.gossi@gmail.com wrote:
Hey Johathan,
Sorry. I subscribed to the python and R mailing list trying to get help, and, in the deluge of emails I got, yours seems to have gotten list in the mix. I apologize for that.
I'm going to go check out the Git Hub now and try and make sense of this.
I did just--just--get the API key authorized.
Drake
On Mon, Feb 25, 2019 at 8:46 AM Jonathan Scott Enderle < notifications@github.com> wrote:
Closed #1 https://github.com/upenndigitalscholarship/regulations-gov-comment-scraper/issues/1 .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/upenndigitalscholarship/regulations-gov-comment-scraper/issues/1#event-2162335946, or mute the thread https://github.com/notifications/unsubscribe-auth/Ati8xvrQyu7xbKmwLAZ_1vaGqOq5uWFyks5vRBN9gaJpZM4VwhoZ .
Hello. Is this script still working? I keep getting Keyerrors at line 19 (submitterName) and below (e.x., organization) even though I know the values are present in the document. I can't seem to debug it. Do you know have an example of a working documentID to run this script on? Thanks!