upenndigitalscholarship / regulations-gov-comment-scraper

10 stars 1 forks source link

Script still working? Script fails at line 19 #1

Closed Camiann closed 5 years ago

Camiann commented 6 years ago

Hello. Is this script still working? I keep getting Keyerrors at line 19 (submitterName) and below (e.x., organization) even though I know the values are present in the document. I can't seem to debug it. Do you know have an example of a working documentID to run this script on? Thanks!

senderle commented 6 years ago

Hi! This was a pretty ad-hoc tool, and I don't know whether it still works as expected under any circumstances! But I will look into it when I get a chance. I have done some other regulations.gov scraping using other scripts, and I might try uploading them here, as I'd expect them to be more robust.

senderle commented 6 years ago

Here's a script that I have used more recently that might have a greater chance of working. I don't recall exactly what it provides but perhaps you will find it useful!

import requests
import csv
import time
import sys

api_key = '' # insert your api key between quotes
docket_id = '' # insert the docket id between quotes (e.g. VA-2016-VHA-0011)
total_docs = 217568  # total number of documents, as indicated by the page for the given docket id
docs_per_page = 1000  # maximum number of results per page; no reason to change

url = ('https://api.data.gov:443/regulations/v3/'
       'documents.json?api_key={}&dktid={}&rpp={}&po={}')

def make_urls():
    return [url.format(api_key, docket_id, docs_per_page, i)
            for i in range(0, total_docs, docs_per_page)]

def get(url):
    r = requests.get(url)
    if r.status_code == 200:
        return r.json().get('documents', {})
    else:
        return {}

def save_batch(batch, ix):
    keys = set(k for d in batch for k in d.keys())
    with open('batch_{:03d}.csv'.format(ix), 'w', encoding='utf-8') as op:
        wr = csv.DictWriter(op, fieldnames=sorted(keys))
        wr.writeheader()
        wr.writerows(batch)

def fetch_urls(urls):
    data = {}
    errors = []
    urls = make_urls()
    for i, url in enumerate(urls):
        d = get(url)
        if d:
            data[url] = d
            save_batch(d, i)
        else:
            errors.append(url)
            print('error on url {}'.format(url))
        time.sleep(5)

    with open('error-urls.txt', 'w', encoding='utf-8') as op:
        for e in errors:
            op.write(e)
            op.write('\n')

def fix_errors(err_file):
    with open(err_file, encoding='utf-8') as ip:
        urls = ip.readlines()
    fetch_urls(urls)

def main():
    urls = make_urls()
    fetch_urls(urls)

if __name__ == '__main__':
    if len(sys.argv) > 1:
        errf = sys.argv[1]
        fix_errors(errf)
    else:
        main()
senderle commented 6 years ago

(When I get a chance I'll replace the current one with this and write a more detailed readme.)

dtgossi commented 5 years ago

Hi there.

New to Github. Need to figure out how to write code to scrape comments off of regulations.gov with the new API, and then off of multiple web pages (though I think that is implied). It's like 11,000 comments. Experienced researcher, new to coding though. Need help.

This, specifically, is the link to the comments I'm working with.

I just sent an email to regulations@erulemakinghelpdesk.com asking for an API key.

D

senderle commented 5 years ago

You should be able to do what you need by replacing the top nine lines of the script quoted above with the below. (You'll still need to insert your own API key, but everything else is as it should be.)

import requests
import csv
import time
import sys

api_key = 'YOUR_API_KEY_HERE' # insert your api key between quotes
docket_id = 'ED-2018-OCR-0064' # insert the docket id between quotes (e.g. VA-2016-VHA-0011)
total_docs = 14835  # total number of documents, as indicated by the page for the given docket id
docs_per_page = 1000  # maximum number of results per page; no reason to change
senderle commented 5 years ago

In case you need to refer back to this, here are screenshots showing where I got the docket ID and document count:

screen shot 2019-02-21 at 1 48 36 pm screen shot 2019-02-21 at 1 48 49 pm
senderle commented 5 years ago

@dtgossi see above! I forgot to @ you.

senderle commented 5 years ago

@dtgossi, hoping the above worked. I'm going to close this since I've now updated the main script and provided a more detailed readme.

dtgossi commented 5 years ago

Hey Johathan,

Sorry. I subscribed to the python and R mailing list trying to get help, and, in the deluge of emails I got, yours seems to have gotten list in the mix. I apologize for that.

I'm going to go check out the Git Hub now and try and make sense of this.

I did just--just--get the API key authorized.

Drake

On Mon, Feb 25, 2019 at 8:46 AM Jonathan Scott Enderle < notifications@github.com> wrote:

Closed #1 https://github.com/upenndigitalscholarship/regulations-gov-comment-scraper/issues/1 .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/upenndigitalscholarship/regulations-gov-comment-scraper/issues/1#event-2162335946, or mute the thread https://github.com/notifications/unsubscribe-auth/Ati8xvrQyu7xbKmwLAZ_1vaGqOq5uWFyks5vRBN9gaJpZM4VwhoZ .

dtgossi commented 5 years ago

Wow Johnathan. You really went above and beyond on this one. Thank you so much.

By 9 lines, you must mean this:

[image: 9 lines.jpg]

What about the rest of the lines, though? What do those do? since there's 66 lines of code in this...

And then what do I have to do to to tabulate the name, category, date, and comment, each into its own variable in the csv?

[image: the 4 things.jpg]

Drake

On Wed, Feb 27, 2019 at 2:17 PM Drake Gossi drake.gossi@gmail.com wrote:

Hey Johathan,

Sorry. I subscribed to the python and R mailing list trying to get help, and, in the deluge of emails I got, yours seems to have gotten list in the mix. I apologize for that.

I'm going to go check out the Git Hub now and try and make sense of this.

I did just--just--get the API key authorized.

Drake

On Mon, Feb 25, 2019 at 8:46 AM Jonathan Scott Enderle < notifications@github.com> wrote:

Closed #1 https://github.com/upenndigitalscholarship/regulations-gov-comment-scraper/issues/1 .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/upenndigitalscholarship/regulations-gov-comment-scraper/issues/1#event-2162335946, or mute the thread https://github.com/notifications/unsubscribe-auth/Ati8xvrQyu7xbKmwLAZ_1vaGqOq5uWFyks5vRBN9gaJpZM4VwhoZ .