r-three / common-pile

Repo to hold code and track issues for the collection of permissively licensed data
MIT License
21 stars 6 forks source link

US Government Publishing Office #64

Open nkandpa2 opened 3 months ago

nkandpa2 commented 3 months ago

US GPO is the agency responsible for publishing documents authored by the US federal government (and thus are public domain) and they provide an API for accessing these documents and associated metadata.

nkandpa2 commented 3 months ago

The usgpo branch has some initial code for collecting this data. The main "collections" containing text files are the following:

There are other collections but the data in these are mostly PDFs. If we have a good way of extracting text from these we can consider the other collections as well.

I've run the code against data from 2023-01-01 to current day and found 17K documents with 300M tokens. If we go with a larger date range like all documents since 2000, this extrapolates out to about 5B tokens. Could be more depending on our appetite for going further back in time.

TODO

storytracer commented 3 months ago

Did you try the USGPO Gov.info bulk data service I mentioned in this issue? Might be less work to download and process.

Gov.info has a bulk data service, which provides machine-readable versions of bills, statutes, codes, etc. as XML and JSON. Here's the documentation in a GH repo: https://github.com/usgpo/bulk-data.

nkandpa2 commented 3 months ago

I had briefly looked into this but based on the file names and modification dates of the bulk data it seems like this is a subset of what's actually published by USGPO. Probably a good idea to check if there's anything in the bulk data that I missed scraping as this would be easy to incorporate.

craffel commented 2 months ago

Is this ready for a PR?

StellaAthena commented 2 months ago

@alon-albalak are you working on this? This was one of the two examples of high-priority sources I sent you last week.

alon-albalak commented 2 months ago

@alon-albalak are you working on this? This was one of the two examples of high-priority sources I sent you last week.

I did not yet. I'm currently at ICLR, will get in touch with @nkandpa2 next week to see what still needs to be done!