Open nkandpa2 opened 7 months ago
The usgpo branch has some initial code for collecting this data. The main "collections" containing text files are the following:
There are other collections but the data in these are mostly PDFs. If we have a good way of extracting text from these we can consider the other collections as well.
I've run the code against data from 2023-01-01 to current day and found 17K documents with 300M tokens. If we go with a larger date range like all documents since 2000, this extrapolates out to about 5B tokens. Could be more depending on our appetite for going further back in time.
Did you try the USGPO Gov.info bulk data service I mentioned in this issue? Might be less work to download and process.
Gov.info has a bulk data service, which provides machine-readable versions of bills, statutes, codes, etc. as XML and JSON. Here's the documentation in a GH repo: https://github.com/usgpo/bulk-data.
I had briefly looked into this but based on the file names and modification dates of the bulk data it seems like this is a subset of what's actually published by USGPO. Probably a good idea to check if there's anything in the bulk data that I missed scraping as this would be easy to incorporate.
Is this ready for a PR?
@alon-albalak are you working on this? This was one of the two examples of high-priority sources I sent you last week.
@alon-albalak are you working on this? This was one of the two examples of high-priority sources I sent you last week.
I did not yet. I'm currently at ICLR, will get in touch with @nkandpa2 next week to see what still needs to be done!
There are other collections but the data in these are mostly PDFs. If we have a good way of extracting text from these we can consider the other collections as well.
Like I mentioned in our recent call, I discovered that the "other collections", which you mention @nkandpa2, do in fact have massive plain texts files available, but they are hidden beneath a second API layer called granules
I discovered this API layer while looking at the Congressional Hearings collection, because I was considering transcribing the hearing recordings and wanted to check how automated transcripts differ from the official ones. So let's take the hearings as an example to look at.
Like other govinfo collections, the hearings are divided by the API into packages
. You query these packages in your get_packages method from the /published
API endpoint. Each package provides a summary with download links which you retrieve from the /packages/{package_id}/summary
endpoint in your get_file_links method.
However, there is a difference between serial collections and non-serial collections in the download links returned by the package summary endpoint. Here is the JSON response for the package BILLS-118hconres1eh
from the non-serial Congressional Bills collection (BILLS
):
{
"originChamber": "HOUSE",
"congress": "118",
"session": "1",
"detailsLink": "https://www.govinfo.gov/app/details/BILLS-118hconres1eh",
"isPrivate": "false",
"title": "Regarding consent to assemble outside the seat of government.",
"branch": "legislative",
"isAppropriation": "false",
"collectionName": "Congressional Bills",
"download": {
"premisLink": "https://api.govinfo.gov/packages/BILLS-118hconres1eh/premis",
"xmlLink": "https://api.govinfo.gov/packages/BILLS-118hconres1eh/xml",
"txtLink": "https://api.govinfo.gov/packages/BILLS-118hconres1eh/htm",
"zipLink": "https://api.govinfo.gov/packages/BILLS-118hconres1eh/zip",
"modsLink": "https://api.govinfo.gov/packages/BILLS-118hconres1eh/mods",
"pdfLink": "https://api.govinfo.gov/packages/BILLS-118hconres1eh/pdf"
},
"pages": "4",
"related": {
"billStatusLink": "https://api.govinfo.gov/packages/BILLSTATUS-118hconres1/xml"
},
"relatedLink": "https://api.govinfo.gov/related/BILLS-118hconres1eh",
"suDocClassNumber": "Y 1.6:, Y 1.4/9:",
"dateIssued": "2023-01-09",
"currentChamber": "HOUSE",
"billVersion": "eh",
"billType": "hconres",
"packageId": "BILLS-118hconres1eh",
"collectionCode": "BILLS",
"governmentAuthor2": "House of Representatives",
"governmentAuthor1": "Congress",
"publisher": "U.S. Government Publishing Office",
"docClass": "hconres",
"lastModified": "2024-06-06T19:32:41Z",
"category": "Bills and Statutes",
"billNumber": "1",
"otherIdentifier": {
"migrated-doc-id": "f:hc1_eh.txt",
"parent-ils-system-id": "000501532",
"child-ils-title": "House concurrent resolutions",
"parent-ils-title": "Congressional bills",
"child-ils-system-id": "000325575",
"stock-number": "021-610-00252-9"
}
}
The JSON response above contains a direct txtLink
download link for the plain text of the package. In contrast, the JSON response below for the package CHRG-118hhrg52370
from the serial Congressional Hearings collection (CHRG
) does not contain a txtLink
download link:
{
"dateIssued": "2023-02-28",
"documentType": "HHRG",
"congress": "118",
"heldDates": [
"2023-02-28"
],
"session": "1",
"packageId": "CHRG-118hhrg52370",
"collectionCode": "CHRG",
"detailsLink": "https://www.govinfo.gov/app/details/CHRG-118hhrg52370",
"title": "UNCERTAINTY, INFLATION, REGULATIONS: CHALLENGES FOR AMERICAN AGRICULTURE",
"branch": "legislative",
"collectionName": "Congressional Hearings",
"download": {
"premisLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/premis",
"zipLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/zip",
"modsLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/mods"
},
"pages": "272",
"governmentAuthor2": "House of Representatives",
"chamber": "HOUSE",
"relatedLink": "https://api.govinfo.gov/related/CHRG-118hhrg52370",
"governmentAuthor1": "Congress",
"publisher": "U.S. Government Publishing Office",
"suDocClassNumber": "Y 4.AG 8/1:118-1",
"docClass": "HHRG",
"lastModified": "2024-06-07T02:18:29Z",
"category": "Congressional Committee Materials",
"otherIdentifier": {
"migrated-doc-id": "f:52370.txt",
"ils-system-id": "001230528"
},
"granulesLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/granules?offsetMark=*&pageSize=100"
}
However, as you can notice, the serial package JSON contains an additional key called granulesLink
. If you request this endpoint, you get the list of granules associated with the package:
{
"count": 1,
"offset": null,
"pageSize": 100,
"nextPage": null,
"previousPage": null,
"granules": [
{
"title": "Uncertainty, Inflation, Regulations: Challenges for American Agriculture",
"granuleId": "CHRG-118hhrg52370",
"granuleLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/granules/CHRG-118hhrg52370/summary",
"granuleClass": null
}
]
}
Packages from the hearings collection usually only contain one granule per package with the granule ID being identical to the package ID, but other serial collections, like the Federal Register (FR
), return several dozen granules or more per package with dedicated granule IDs. As you can see from the granule list JSON response, each granule has its own summary link, which returns a JSON response similar to the package summary JSON. This is where we can find the plain text download links for serial collections: in the granule JSON summary endpoint:
...
"packageId": "CHRG-118hhrg52370",
"packageLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/summary",
"committees": [
{
"authorityId": "hsag00",
"chamber": "H",
"committeeName": "Committee on Agriculture",
"type": "S"
}
],
"collectionCode": "CHRG",
"detailsLink": "https://www.govinfo.gov/app/details/CHRG-118hhrg52370/CHRG-118hhrg52370",
"title": "Uncertainty, Inflation, Regulations: Challenges for American Agriculture",
"isAppropriation": "false",
"collectionName": "Congressional Hearings",
"granuleClass": "OTHERPART",
"granuleId": "CHRG-118hhrg52370",
"download": {
"premisLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/premis",
"txtLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/granules/CHRG-118hhrg52370/htm",
"zipLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/zip",
"modsLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/granules/CHRG-118hhrg52370/mods",
"pdfLink": "https://api.govinfo.gov/packages/CHRG-118hhrg52370/granules/CHRG-118hhrg52370/pdf"
},
...
The single hearing transcript for the granule CHRG-118hhrg52370
contains more than 150K whitespace-separated tokens. Such long-form transcripts seem particularly valuable as a data source, especially because of their multi-speaker nature. The token count for the Federal Register varies more from document to document, but the collection contains over 92K+ granules, so I think it is worth adjusting the code for serial collections in general, so that we can get all the text data from the remaining collections. Let me know if you need a hand adjusting the code @nkandpa2 !
US GPO is the agency responsible for publishing documents authored by the US federal government (and thus are public domain) and they provide an API for accessing these documents and associated metadata.