usgpo / bulk-data

User Guides for XML on the govinfo Bulk Data Repository. For information about Bill Status XML Bulk Data, see https://github.com/usgpo/bill-status.
https://www.govinfo.gov/bulkdata
262 stars 97 forks source link

Incomplete XML for H.R. 5376 #95

Closed LonelySpaceman closed 2 years ago

LonelySpaceman commented 2 years ago

Looking into the XML file for H.R. 5376 of the 117th congress, the tags that should contain the header data for SEC. 20001(c)(4) is completely missing. This is the only instance of missing formatting data that I've found in the database, so I don't know if this is a larger issue, but it might be worth investigating.

The missing tags should be: <paragraph id="some ID"><enum>(4)</enum><header>Rebuild america's schools subgrants to eligible local educational agencies</header>

llaplant commented 2 years ago

Hi, which version of the bill are you accessing? For example, the version below is the RH (Reported in House) version.

https://www.govinfo.gov/content/pkg/BILLS-117hr5376rh/xml/BILLS-117hr5376rh.xml

LonelySpaceman commented 2 years ago

I'm also looking at the RH version of the bill. I just realized I'm using the FDSys API to retrieve the XML data because it automatically retrieves the most recent version of the bill. This might be a question more up the alley of the folks behind FDSys rather than y'all. If you still want to look into it, the API link I'm retrieving the XML Data from is:

https://api.fdsys.gov/link?collection=bills&billtype=hr&billnum=5376&congress=117&link-type=xml

Thanks!

jonquandt commented 2 years ago

@LonelySpaceman --

Hi -- govinfo replaced the FDsys site a few years ago, and the fdsys link service links you're using have been redirected to govinfo equivalents. The same team that developed the FDsys site is operating the govinfo site

Thank you for pointing out the potential issue with the data. We'll take a look at that.

I have two potential suggestions for accessing govinfo data programmatically.

govinfo Link Service

Here's documentation for the govinfo link service | link-service repo

The equivalent for the link you have above is: https://www.govinfo.gov/link/bills/117/hr/5376?link-type=xml

As you said, this will return the most recent version of the bill by default.

If you want to specify the version, you can add a billversion parameter for a given version. e.g. https://www.govinfo.gov/link/bills/117/hr/5376?link-type=xml&billversion=rh

In this case, there's only one bill version available for this particular bill: https://www.govinfo.gov/app/details/BILLS-117hr5376rh/related

govinfo API

The other suggestion I would have is to take a look at the govinfo API | GitHub repo

For example, you can get a complete list of BILLS added or updated within a specified time range using our collections service - e.g. https://api.govinfo.gov/collections/BILLS/2022-01-01T00:00:00Z?offset=0&pageSize=100&api_key=DEMO_KEY

There are several parameters that allow filtering by congress, billversion, etc.

This is helpful for getting the newest content from the system. It returns a list of BILLS packages along with a link to a json summary that provides links to content and metadata, including the XML.

Example summary for the particular bill you referenced above https://api.govinfo.gov/packages/BILLS-117hr5376rh/summary?api_key=DEMO_KEY The collections service returns results based on the lastModified date/time in the summary json above.

That summary json also provides links to the associated BILLSTATUS xml as well as related documents within our system, including other bill versions, associated public and private laws, Presidential Signing Statements, Congressional Committee Prints, Committee Reports, and STATUTE and USCODE references, as they become available within govinfo

We also have a published endpoint that allows you to look for packages published based on dateIssued field.

We're working on expanding relationships that are available via the API and making additional enhancements available based on community feedback. Feel free to add or comment on existing issues.

LonelySpaceman commented 2 years ago

This is awesome guys, thanks so much for responding so quickly. How far back does the govInfo xml database go? I know that FDsys only went back to the 113th congress.

LonelySpaceman commented 2 years ago

Also, I switched over my program to using the govInfo link service to retrieve the xml and it's working perfectly, but I did find another missing header in the same bill. The header for Section 70807(c) "imposition of fee" is also missing.

jonquandt commented 2 years ago

The govinfo bulkdata repository can show you the coverage of the different collections in bulk xml.

We have XML versions of Congressional Bill text from the 113th Congress on, including versions of the bill text in USLM -- for more information on that format, see the uslm repo .

We also do have plain text of the BILLS going back to the 103rd Congress (1993).

Here's the collection browse page that can give you a little more info. The associated help page is also useful.

llaplant commented 2 years ago

@LonelySpaceman is this the header? image

It is also in the PDF at the bottom of page 60. image

https://www.govinfo.gov/content/pkg/BILLS-117hr5376rh/pdf/BILLS-117hr5376rh.pdf

LonelySpaceman commented 2 years ago

Yeah, I double-checked the xml and both of the tags I thought were missing are actually there. It's a problem with my code, not the database. So sorry y'all, thanks for all of your help!