usgpo / bulk-data

User Guides for XML on the govinfo Bulk Data Repository. For information about Bill Status XML Bulk Data, see https://github.com/usgpo/bill-status.
https://www.govinfo.gov/bulkdata
266 stars 97 forks source link

US Code - Nodes in a bulk #25

Open Yelrado opened 6 years ago

Yelrado commented 6 years ago

Is it possible to have https://www.govinfo.gov/wssearch/rb//uscode/2016/ and all its children as a bulk?

Sorry if there's a link to this but I'm unable to find it.

Thank you in advance

jonquandt commented 6 years ago

@Yelrado

Not currently. Note: We don't recommend use of the /wssearch context in this manner, as we are likely to make changes that would break anything you built against them.

We are working on some services to allow for programmatic retrieval of content and metadata across the site. More information will be forthcoming soon and we'll be notifying folks via our developers page and relevant github repos.

In the meantime, you could use our sitemaps to access all of the USCODE packages from 2016 and then download the zip packages that contain all of the content and metadata (MODS descriptive and Premis for preservation) files associated with each package.

https://www.govinfo.gov/sitemaps https://www.govinfo.gov/sitemap/USCODE_sitemap_index.xml

This contains a list of the USCODE 2016 packages in the system, along with links to their content detail. https://www.govinfo.gov/sitemap/USCODE_2016_sitemap.xml

examples:

<url>
<loc>
https://www.govinfo.gov/app/details/USCODE-2016-title1
</loc>
<lastmod>2017-07-31T17:09:13.764Z</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>
https://www.govinfo.gov/app/details/USCODE-2016-title2
</loc>
<lastmod>2017-07-31T17:10:51.986Z</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>

given the package information above, you could either follow the links and grab the associated zip links, or create the relevant zip links directly by doing a find and replace for 'app/details' and replacing with 'content/pkg' and appending .zip to the end of each line.

e.g. https://www.govinfo.gov/content/pkg/USCODE-2016-title1.zip https://www.govinfo.gov/content/pkg/USCODE-2016-title2.zip

Other resources

The Office of the Law Revision Counsel is the source for this data, so you may want to look at their site as well to see if that might meet your needs http://uscode.house.gov/

Take a look at the following: http://uscode.house.gov/download/download.shtml http://uscode.house.gov/download/annualhistoricalarchives/annualhistoricalarchives.htm - this allows you to download prior versions of the USCODE in pdf, locator, and xhtml formats. http://uscode.house.gov/download/annualhistoricalarchives/pdf/2016/index.html - 2016 USCODE PDF download pages

Yelrado commented 6 years ago

Thank you, @jonquandt for such a great response.

I've review the links, including other sources but I'm still finding more useful to work with /wssearch since it's all in JSON with relevant information, all in a tree.

I completely understand the risks and for now I will stick to it.

I wish you could release a bulk of all this json, though. It would make it a lot easier to work with.

Thanks again for your time.

jonquandt commented 6 years ago

@Yelrado - no problem. Could you expand on

release a bulk of all this json

? I want to get a better sense of what your use case is. Is it that you want a json that has package metadata with a set of links to the content or do you only want bulk access to the content? Is the type of information that's in our mods.xml files what you're looking for, but in a different format?

Yelrado commented 6 years ago

The main differente between the JSON and mods.xml is how the information is structured.

It's very useful for my purpuse to have it structured like a tree:

The mods.xml lists all the elements without an structure.

Having built-in tools to import JSON in practicaly any languange is an additional advantage.

Answering your question. I do have access to the content by the zips you're provinding. What I need is a JSON that has all the metadata (links to the content) at a fine granule level like /wssearch provides already but instead of downloading one by one having all in a bulk.

I hope I explained myself better this time.