lmullen / cchc

America's Public Bible for Computing Cultural Heritage in the Cloud
Creative Commons Zero v1.0 Universal
8 stars 1 forks source link

LOC.gov pagination limits make it impossible to get all of big collections #22

Open lmullen opened 3 years ago

lmullen commented 3 years ago

The pagination limits make it so that you can't go past 100,000 items. This means you can't get all of Chronicling America.

A sample log entry from the crawler

cchc-crawler  | time="2021-08-13T03:47:34Z" level=warning msg="HTTP error when fetching from API" http_code=400 http_error="400 Bad Request" url="https://www.loc.gov/collections/chronicling-america/?at%21=aka%2Cbreadcrumbs%2Cbrowse%2Ccategories%2Ccontent%2Ccontent_is_post%2Cexpert_resources%2Cfacet_trail%2Cfacet_views%2Cfacets%2Cfeatured_items%2Cform_facets%2Clegacy-url%2Cnext%2Cnext_sibling%2Coptions%2Coriginal_formats%2Cpages%2Cpartof%2Cprevious%2Cprevious_sibling%2Cresearch-centers%2Cshards%2Csite_type%2Csubjects%2Ctimeline_1852_1880%2Ctimeline_1881_1900%2Ctimeline_1901_1925%2Ctimestamp%2Ctopics%2Cviews&c=1000&fa=online-format%3Aonline+text&fo=json&sp=101&st=list"

Going to that URL in the pagination does in fact return a 400 error.

Probably need to ask if there is a way around this.

Cf. #18.