usgpo / api

services to access govinfo content and metadata
https://api.govinfo.gov
Other
183 stars 58 forks source link

Invalid html in BillStatus documents #86

Closed evan-benoit closed 1 year ago

evan-benoit commented 3 years ago

Hello GovInfo! We're seeing about 2% of the BILLSTATUS documents that we examine have faulty HTML in the <billSummaries> section. For example:

https://www.govinfo.gov/bulkdata/BILLSTATUS/117/s/BILLSTATUS-117s294.xml

The <billSummaries> section has unclosed <p> tags. Some of the <p> tags have corresponding </p> tag, but others do not. Any idea why this is? Can anything be done about it?

Thanks! -Evan

jonquandt commented 3 years ago

@evan-benoit - updated your comment to include code fencing around the tags.

I'm looking into this. If you can provide a few additional example IDs, that will help me to investigate. My initial thinking is that this is in the source data.

evan-benoit commented 3 years ago

Sure, here's a few other examples, all with unmatched <p> tags

I'm finding this problem in about ~2% of the BILLSTATUS documents.

jonquandt commented 3 years ago

Thank you -- the team that helps supply this is aware of the issue and working to address it by replacing a legacy system. I don't know the exact timeline for this to be completed.

evan-benoit commented 3 years ago

Thanks, I appreciate the speedy response!

jonquandt commented 1 year ago

As an update, this is still in work upstream of us. This is being tracked by the Library of Congress here: https://github.com/LibraryOfCongress/api.congress.gov/issues/2

I am closing the issue here because it will end up being resolved upstream and then we will update our BILLSTATUS and BILLSUM files.