ukwa / ukwa-manage

Shepherding our web archives from crawl to access.
Apache License 2.0
10 stars 5 forks source link

Document Harvester issues #78

Closed anjackson closed 2 years ago

anjackson commented 3 years ago
anjackson commented 3 years ago

e.g.

2021-09-03 11:50:28,143 WARNING: Data for wb.AvailableInWayback(url=https://www.camden.gov.uk/documents/20142/149005548/Writing+Key+Stage+2+Voluntary+Example+Minimum+Expectations.pdf/b4e90cb4-e090-d190-c13a-f135758905d5?t=1623434758935, ts=20210822095109, check_available=False, wayback_prefix=http://ingest:9081/wayback, cdxserver_endpoint=http://cdx.api.wa.bl.uk/data-heritrix) does not exist (yet?). The task is an external data dependency, so it cannot be run from this luigi process.

but in CDX

uk,gov,camden)/documents/20142/149005548/writing+key+stage+2+voluntary+example+minimum+expectations.pdf/b4e90cb4-e090-d190-c13a-f135758905d5?t=1623434758935 20210822095109 https://www.camden.gov.uk/documents/20142/149005548/Writing+Key+Stage+2+Voluntary+Example+Minimum+Expectations.pdf/b4e90cb4-e090-d190-c13a-f135758905d5?t=1623434758935 application/pdf 200 F3BHENDEC273FMJTH47XB3ZSUMDSO4PI - - 0 141743977 /heritrix/output/frequent-npld/20210617123541/warcs/BL-NPLD-20210822094623431-14815-80~npld-heritrix3-worker-1~8443.warc.gz
anjackson commented 3 years ago

Ah, looks like https://github.com/nla/outbackcdx/issues/12#issuecomment-351932136

anjackson commented 3 years ago

Okay, finally implementing the double-quoting seems to have worked!