upenndigitalscholarship / regulations-gov-comment-scraper

10 stars 1 forks source link

Tool needs update for v4 #4

Open rebekahjacob opened 3 years ago

rebekahjacob commented 3 years ago

Major update of site to https://api.regulations.gov/v4/ and current tool will no longer run as is. When inspect error code: { "error": { "code": "MOVED_PERMANENTLY", "message": "Regulations.gov API v3 is no longer available for use. Please contact us at https://www.regulations.gov/support for assistance." } }

v3 is no longer accessible- attached is email exchange with regulations.gov helpdesk 5/21/21 helpdesk email.pdf

v4 major difference is the endpoint is split into three categories (Document, Comment, and Docket), also possibly more stringent on requests per minute. See documentation here: https://open.gsa.gov/api/regulationsgov/#frequently-asked-questions

rebekahjacob commented 3 years ago

Here is some similar work completed by Will Jobs in v4: https://github.com/willjobs/regulations-public-comments

With with Will's permission, I'm including a bit of our email exchange from 5/25/21. Will wrote: "As I’m sure you know, the annoying thing with the way comments are set up is that they’re associated with a document, and a docket can contain many documents. You may have also gathered from my blog posts that The way the new API is set up, you can’t query for all comments on a given docket, and if you want all comments on a given document, you have to know the document’s “objectId”, which is different from the public-facing documentId. So the order of operations is: use docket to look up associated documents, then use each document’s objectId to get its comments. There’s an extra step after that, too, because when you query for the comments on a document, you get some metadata (I call it “header” information). To get the actual text of the comment (and more detailed info), you have to then access each comment individually, one at a time.

In addition, there’s an annoying pagination “feature” that the API uses, which gives you up to 250 items per page (request), and up to 20 pages per query. If your query returns more than 250x20 = 5000 items, you have to manually deal with it by first sorting your queries by lastModifiedDate, then after 20 pages, filtering the next query by lastModifiedDate >= max(lastModifiedDate) from the previous query."

See Will's readme text for more information.