nkandpa2 commented 7 months ago

US GPO is the agency responsible for publishing documents authored by the US federal government (and thus are public domain) and they provide an API for accessing these documents and associated metadata.

nkandpa2 commented 7 months ago

The usgpo branch has some initial code for collecting this data. The main "collections" containing text files are the following:

Congressional Bills
United States Budget
Congressional Directory
Code of Federal Regulations
Compilation of Presidential Documents
Congressional Record Index
Coastal Zone Information Center
Government Accountability Office Reports and Comptroller General Decisions
Bulk Submission
Additional Government Publications
Journal of the House of Representatives
History of Bills
Privacy Act Issuances
Public and Private Laws
United States Code

There are other collections but the data in these are mostly PDFs. If we have a good way of extracting text from these we can consider the other collections as well.

I've run the code against data from 2023-01-01 to current day and found 17K documents with 300M tokens. If we go with a larger date range like all documents since 2000, this extrapolates out to about 5B tokens. Could be more depending on our appetite for going further back in time.

TODO

[ ] Write script for converting to Dolma
[ ] Write driver script to run the full job

storytracer commented 7 months ago

Did you try the USGPO Gov.info bulk data service I mentioned in this issue? Might be less work to download and process.

Gov.info has a bulk data service, which provides machine-readable versions of bills, statutes, codes, etc. as XML and JSON. Here's the documentation in a GH repo: https://github.com/usgpo/bulk-data.

nkandpa2 commented 7 months ago

I had briefly looked into this but based on the file names and modification dates of the bulk data it seems like this is a subset of what's actually published by USGPO. Probably a good idea to check if there's anything in the bulk data that I missed scraping as this would be easy to incorporate.

craffel commented 7 months ago

Is this ready for a PR?

StellaAthena commented 7 months ago

@alon-albalak are you working on this? This was one of the two examples of high-priority sources I sent you last week.

alon-albalak commented 6 months ago

@alon-albalak are you working on this? This was one of the two examples of high-priority sources I sent you last week.

I did not yet. I'm currently at ICLR, will get in touch with @nkandpa2 next week to see what still needs to be done!

storytracer commented 3 months ago

There are other collections but the data in these are mostly PDFs. If we have a good way of extracting text from these we can consider the other collections as well.

Like I mentioned in our recent call, I discovered that the "other collections", which you mention @nkandpa2, do in fact have massive plain texts files available, but they are hidden beneath a second API layer called granules

I discovered this API layer while looking at the Congressional Hearings collection, because I was considering transcribing the hearing recordings and wanted to check how automated transcripts differ from the official ones. So let's take the hearings as an example to look at.

Like other govinfo collections, the hearings are divided by the API into packages. You query these packages in your get_packages method from the /published API endpoint. Each package provides a summary with download links which you retrieve from the /packages/{package_id}/summary endpoint in your get_file_links method.

However, there is a difference between serial collections and non-serial collections in the download links returned by the package summary endpoint. Here is the JSON response for the package BILLS-118hconres1eh from the non-serial Congressional Bills collection (BILLS):

{
  "originChamber": "HOUSE",
  "congress": "118",
  "session": "1",
  "detailsLink": "https://www.govinfo.gov/app/details/BILLS-118hconres1eh",
  "isPrivate": "false",
  "title": "Regarding consent to assemble outside the seat of government.",
  "branch": "legislative",
  "isAppropriation": "false",
  "collectionName": "Congressional Bills",
  "download": {
    "premisLink": "https://api.govinfo.gov/packages/BILLS-118hconres1eh/premis",
    "xmlLink": "https://api.govinfo.gov/packages/BILLS-118hconres1eh/xml",
    "txtLink": "https://api.govinfo.gov/packages/BILLS-118hconres1eh/htm",
    "zipLink": "https://api.govinfo.gov/packages/BILLS-118hconres1eh/zip",
    "modsLink": "https://api.govinfo.gov/packages/BILLS-118hconres1eh/mods",
    "pdfLink": "https://api.govinfo.gov/packages/BILLS-118hconres1eh/pdf"
  },
  "pages": "4",
  "related": {
    "billStatusLink": "https://api.govinfo.gov/packages/BILLSTATUS-118hconres1/xml"
  },
  "relatedLink": "https://api.govinfo.gov/related/BILLS-118hconres1eh",
  "suDocClassNumber": "Y 1.6:, Y 1.4/9:",
  "dateIssued": "2023-01-09",
  "currentChamber": "HOUSE",
  "billVersion": "eh",
  "billType": "hconres",
  "packageId": "BILLS-118hconres1eh",
  "collectionCode": "BILLS",
  "governmentAuthor2": "House of Representatives",
  "governmentAuthor1": "Congress",
  "publisher": "U.S. Government Publishing Office",
  "docClass": "hconres",
  "lastModified": "2024-06-06T19:32:41Z",
  "category": "Bills and Statutes",
  "billNumber": "1",
  "otherIdentifier": {
    "migrated-doc-id": "f:hc1_eh.txt",
    "parent-ils-system-id": "000501532",
    "child-ils-title": "House concurrent resolutions",
    "parent-ils-title": "Congressional bills",
    "child-ils-system-id": "000325575",
    "stock-number": "021-610-00252-9"
  }
}

The JSON response above contains a direct txtLink download link for the plain text of the package. In contrast, the JSON response below for the package CHRG-118hhrg52370 from the serial Congressional Hearings collection (CHRG) does not contain a txtLink download link:

{
  "dateIssued": "2023-02-28",
  "documentType": "HHRG",
  "congress": "118",
  "heldDates": [
    "2023-02-28"
  ],
  "session": "1",
  "packageId": "CHRG-118hhrg52370",
  "collectionCode": "CHRG",
  "detailsLink": "https://www.govinfo.gov/app/details/CHRG-118hhrg52370",
  "title": "UNCERTAINTY, INFLATION, REGULATIONS: CHALLENGES FOR AMERICAN AGRICULTURE",
  "branch": "legislative",
  "collectionName": "Congressional Hearings",
  "download": {
    "premisLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/premis",
    "zipLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/zip",
    "modsLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/mods"
  },
  "pages": "272",
  "governmentAuthor2": "House of Representatives",
  "chamber": "HOUSE",
  "relatedLink": "https://api.govinfo.gov/related/CHRG-118hhrg52370",
  "governmentAuthor1": "Congress",
  "publisher": "U.S. Government Publishing Office",
  "suDocClassNumber": "Y 4.AG 8/1:118-1",
  "docClass": "HHRG",
  "lastModified": "2024-06-07T02:18:29Z",
  "category": "Congressional Committee Materials",
  "otherIdentifier": {
    "migrated-doc-id": "f:52370.txt",
    "ils-system-id": "001230528"
  },
  "granulesLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/granules?offsetMark=*&pageSize=100"
}

However, as you can notice, the serial package JSON contains an additional key called granulesLink. If you request this endpoint, you get the list of granules associated with the package:

{
  "count": 1,
  "offset": null,
  "pageSize": 100,
  "nextPage": null,
  "previousPage": null,
  "granules": [
    {
      "title": "Uncertainty, Inflation, Regulations: Challenges for American Agriculture",
      "granuleId": "CHRG-118hhrg52370",
      "granuleLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/granules/CHRG-118hhrg52370/summary",
      "granuleClass": null
    }
  ]
}

Packages from the hearings collection usually only contain one granule per package with the granule ID being identical to the package ID, but other serial collections, like the Federal Register (FR), return several dozen granules or more per package with dedicated granule IDs. As you can see from the granule list JSON response, each granule has its own summary link, which returns a JSON response similar to the package summary JSON. This is where we can find the plain text download links for serial collections: in the granule JSON summary endpoint:

...
  "packageId": "CHRG-118hhrg52370",
  "packageLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/summary",
  "committees": [
    {
      "authorityId": "hsag00",
      "chamber": "H",
      "committeeName": "Committee on Agriculture",
      "type": "S"
    }
  ],
  "collectionCode": "CHRG",
  "detailsLink": "https://www.govinfo.gov/app/details/CHRG-118hhrg52370/CHRG-118hhrg52370",
  "title": "Uncertainty, Inflation, Regulations: Challenges for American Agriculture",
  "isAppropriation": "false",
  "collectionName": "Congressional Hearings",
  "granuleClass": "OTHERPART",
  "granuleId": "CHRG-118hhrg52370",
  "download": {
    "premisLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/premis",
    "txtLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/granules/CHRG-118hhrg52370/htm",
    "zipLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/zip",
    "modsLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/granules/CHRG-118hhrg52370/mods",
    "pdfLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/granules/CHRG-118hhrg52370/pdf"
  },
...

The single hearing transcript for the granule CHRG-118hhrg52370 contains more than 150K whitespace-separated tokens. Such long-form transcripts seem particularly valuable as a data source, especially because of their multi-speaker nature. The token count for the Federal Register varies more from document to document, but the collection contains over 92K+ granules, so I think it is worth adjusting the code for serial collections in general, so that we can get all the text data from the remaining collections. Let me know if you need a hand adjusting the code @nkandpa2 !

r-three / common-pile

US Government Publishing Office #64

TODO