usgpo / api

services to access govinfo content and metadata
https://api.govinfo.gov
Other
169 stars 55 forks source link

API request to access all congressional reports by the particular committee #155

Open zymbuzz opened 2 months ago

zymbuzz commented 2 months ago

Hi, thanks a lot for maintaining and developing API.

I am currently learning the documentation about accessing some resources via API. However, I need help implementing one particular request via the current API functionality. Mainly, I would like to access all metadata but also text files of congressional reports by the House Committee of Ways and Means.

My first approach was to link services, but I could not filter using the committee. Alternatively, I could only rely on a search via API, where I could select all the documents from the committee. However, I wonder if the search is too reliable. The last possibility is to rely on the Congress API, which has more flexibility but, to my understanding, covers fewer sources.

I would appreciate your guidance on how you would access all congressional reports by the committee via API.

jonquandt commented 2 months ago

I would recommend using our search service. I'm not sure what you mean by

However, I wonder if the search is too reliable.

If our parsing has identified a report as from a particular committee, doing a search service request will return it.

Here is a curl that should return the results you are after:

curl -X 'POST' \
  'https://api.govinfo.gov/search' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "query": "collection:crpt committee:(ways and means)",
  "pageSize": 10,
  "offsetMark": "*",
  "sorts": [
    {
      "field": "relevancy",
      "sortOrder": "DESC"
    }
  ],
  "historical": true,
  "resultLevel": "default"
}'

It will return a set of results that look like this:

{
  "results": [
    {
      "title": "EXTENDING LIMITS OF U.S. CUSTOMS WATERS ACT",
      "packageId": "CRPT-118hrpt436",
      "granuleId": "CRPT-118hrpt436-pt2",
      "lastModified": "2024-04-08T03:50:59Z",
      "governmentAuthor": [
        "Congress",
        "House of Representatives"
      ],
      "dateIssued": "2024-04-02",
      "collectionCode": "CRPT",
      "resultLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436-pt2/summary",
      "dateIngested": "2024-04-07",
      "download": {
        "premisLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/premis",
        "txtLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436-pt2/htm",
        "zipLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/zip",
        "modsLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436-pt2/mods",
        "pdfLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436-pt2/pdf"
      },
      "relatedLink": null
    },
    {
      "title": "EXTENDING LIMITS OF U.S. CUSTOMS WATERS ACT",
      "packageId": "CRPT-118hrpt436",
      "granuleId": "CRPT-118hrpt436",
      "lastModified": "2024-04-08T03:50:59Z",
      "governmentAuthor": [
        "Congress",
        "House of Representatives"
      ],
      "dateIssued": "2024-04-02",
      "collectionCode": "CRPT",
      "resultLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436/summary",
      "dateIngested": "2024-04-07",
      "download": {
        "premisLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/premis",
        "txtLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436/htm",
        "zipLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/zip",
        "modsLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436/mods",
        "pdfLink": "https://api.govinfo.gov/packages/CRPT-118hrpt436/granules/CRPT-118hrpt436/pdf"
      },
      "relatedLink": null
    },
....
  ],
  "offsetMark": "AoJw4JCJiY0DPwBDUlBULTExOGhycHQzNDctQ1JQVC0xMThocnB0MzQ3",
  "count": 785
}

As you can see, the results include direct links to the different download options.

You can submit the same search request, adding the offsetMark in the previous response to get the next set of data.

This is the equivalent search in the GovInfo UI

Note, you could also use hswm00 in the committee parameter, as that is the committee's authority id.

zymbuzz commented 2 months ago

Thanks a lot for the explanation.

Regarding the "unreliability", I was unsure if I understood what the search exactly does and how much I could rely on it over the long term. To my understanding, the search is currently in beta. But most importantly, I was unsure if the search was filtering across the universe of documents or if it was looking for something close to the query.

To illustrate the second point, in your equivalent search in the govinfo UI, on the left-hand side, one can refine the search by selecting collections to be either congressional reports or congressional serial sets. I would have expected the search request to exclude "congressional serial sets".

Could you also explain where I can specify the committee parameter? But also, where could I find mnemonics for other committees?

jonquandt commented 2 months ago

The beta label indicated that the Search service was still an early release. Functionally, it is production quality, but there was a possibility that it would have changes to the interface. We'll be removing that in the near future - likely in our June release. Some more info about the search service can be found in this overview article

There are some Congressional Serial Set documents that are also coded as Congressional Reports. For example, H. Rept 94-1266 is a Serial Set package that is a Congressional Report.

For more information on document types available within the Serial Set, see https://www.govinfo.gov/help/serial-set#types on the Congressional Serial Set help page.

You can see some field operators/parameters that you can use to specify more directly in the various collection help pages. Here is a list of specific metadata values for the Congressional Reports collection

In this case, committee will search against the congCommittee element in MODS, which allows for searching by name or authorityId

In the future, we may consider developing a more comprehensive/across collections list of parameters that can be referenced.

zymbuzz commented 2 months ago

Thanks a lot for getting back to me. I managed to set up the search following your instructions.

I expect to import all the documents associated with the mentioned congressional reports. Is using API the right way, or is it better to rely on the bulk data import?

Another question is related to the time sample available. I noticed the data is only available from 1995, with some selected documents from earlier periods. So, I wonder if I should rely on the Congress API to get the earlier documents. Do you know if you use Congress API to import info into Govinfo? Are there some noticeable differences between APIs?

thanks again for your help

jonquandt commented 2 months ago

Yes, the API is the appropriate path for this. You can either grab contents from the search service download links directly or go via the resultLink to get additional information about the documents. Note that the zipLink will contain all content and metadata files for the entire package.

Congress.gov imports Congressional Reports from GovInfo using the GovInfo API, so you shouldn't find any Congressional Reports there that you don't see on GovInfo. GovInfo uses the Congress.gov API to create the bulkdata BILLSUM and BILLSTATUS xml and retrieve some authority information for committees and individual Members - the authority information is used as part of GovInfo parsing to provide richer metadata for search and access purposes.

Generally speaking, most of the congressional content (Record, Calendars, Bill text, Law text, congressional documents) available on Congress.gov are pulled from the GovInfo API. There are certainly implementation details between the two, but a high-level difference would be the focus. Congress.gov has a focus on legislative materials primarily for the needs of Congress, while GovInfo acts as a preservation and access repository for official Government publications from all three branches. In addition to legislative publications, GovInfo publishes executive publications, like the Federal Register, Code of Federal Regulations, daily Compilation of Presidential Documents, and other executive agency publications. GovInfo also publishes opinions from the Administrative Offices of the U.S. Courts, covering a large number of federal district, bankruptcy, and appellate courts. You may want to see what's available on our help pages for additional examples.

The scope of available Congressional Reports is based on Public Law 103-40, which expanded GPO's mission to provide electronic access to Federal Government information.

We continue to make new reports available and GPO also works to increase the historical scope of a number of collections via digitization of physical publications.

zymbuzz commented 1 month ago

Thanks a lot for your answer. It clarified a lot.

Could you suggest whom I could contact to find some historical documents? I am interested in the historical reports by the Ways and Means committee. Some historical documents from some committees are readily available via their website.