usgpo / bulk-data

User Guides for XML on the govinfo Bulk Data Repository. For information about Bill Status XML Bulk Data, see https://github.com/usgpo/bill-status.
https://www.govinfo.gov/bulkdata
272 stars 100 forks source link

Entire Section text version #171

Open TimTCM opened 2 weeks ago

TimTCM commented 2 weeks ago

Even with the current Microcomp format, would GPO be willing to publish a text file of the Congressional Record's four sections in entirety?

Entire sections are available in PDF, but not in text.

One can create a combined version with the downloaded zip file, and I am doing so right now, and confidence in the product would be greater if the official source had this available.

Over time, I don't plan to store the Record indefinitely, and so if later there is content that needs re-caching, I'd like to be able to pull from the official source without reprocessing the whole zip file on the fly each time.

This would be especially helpful for things that combine pages like House Morning Hour debate, one-minute speeches, and then even more so in the Senate where a Senate speaker's remarks can cross multiple pages as they are currently divided.

Thank you

jonquandt commented 2 weeks ago

@TimTCM - thanks for the suggestion. We'll look at feasibility of this.

From an acceptance criteria point of view, would having a complete package html/text file meet the need? This would include the text for the entire daily issue.

For this:

This would be especially helpful for things that combine pages like House Morning Hour debate, one-minute speeches, and then even more so in the Senate where a Senate speaker's remarks can cross multiple pages as they are currently divided.

Could you provide an example where a single Senate speaker's remarks are split across multiple granules? That will help us understand that portion a bit better.

If they are speaking on different subjects, it makes sense to me that they would have separate granules in GovInfo, but perhaps there's a scenario that I'm not thinking of at the moment.