The continuous crawling approach developed so far has, sadly, proven unstable. Instead, we will fall back on the existing 'pulse' crawling approach while we work out how best to proceed.
The 'pulse' approach is a limited approach, a compromise that delivers stable crawls but works around the inability of H3 to easily support the large number of separate crawls that our curators define in W3ACT. Instead, each set of Targets is grouped by frequency, and each frequency launches at the same point in a regular cycle. For example, there is a daily crawl that is stopped and re-launched every day at 9am. We therefore only get seeds at that time (roughly), and only ever get one-day deep. However, it does give stable and predictable job 'chunks' and H3 can cope.
The old code had become confused due to the split between the stable system and the document harvesting system. The code has been merged, but needs updating and testing, with the document harvester approach being supported appropriately.
To Do
[x] Port w3start.py into Celery
[x] Include BL Heritrix modules, build against LBS H3.
[x] Add movetohdfs.py daemon to docker test system.
[x] Ensure all WARCs (ordinary, viral and image) get copied to HDFS.
[x] Build validate job stage, scanning crawl.log, checking for WARCs in HDFS and uploading logs and job files to HDFS.
[x] Port build sip to Celery.
[x] Port submit sip to Celery.
[x] Add w3act-info.json to the SIP.
[x] Modify movetohdfs.py etc. to leave behind a SHA-512 hash file, which the workflow can pick up to check on and pass down the line.
[x] Port document-harvester logic to H3-LBS-UKWA and Celery (based on CDX or crawl.log rather than a crawl feed?).
[x] Write to watched-surts.txt file alongside surts.txt
[x] Backport the dev AMQPIndexableCrawlLogFeed for CDX and documents.
[x] Port uristocdxserver and docstow3act to Celery (message interop?)
[ ] Switch indexing and document extraction to a post-assembly process? Rather than the current H3 AMQPIndexableCrawlLogFeed modules? Because of the need to:
[ ] Ensure CDX server contains full path of WARC file (can we do this using the current logic?)
[ ] Ensure content is available in Wayback.
[x] Define production queues per task rather than using one queue.
Other ideas
[ ] More sanity-checks in validate job, e.g. check logs are not empty, check there are no additional WARC files, i.e. not mentioned in logs, etc.
[ ] Store ARKs and hashes in launch folder and in the ZIP. See CrawlJobOutput.
[ ] Create validate sip to inspect the store for content and verify it.
[ ] Add a test W3ACT to the docker system, populated appropriately, and set up to crawl a Dockerized test site (acid-crawl idea).
The continuous crawling approach developed so far has, sadly, proven unstable. Instead, we will fall back on the existing 'pulse' crawling approach while we work out how best to proceed.
The 'pulse' approach is a limited approach, a compromise that delivers stable crawls but works around the inability of H3 to easily support the large number of separate crawls that our curators define in W3ACT. Instead, each set of Targets is grouped by frequency, and each frequency launches at the same point in a regular cycle. For example, there is a
daily
crawl that is stopped and re-launched every day at 9am. We therefore only get seeds at that time (roughly), and only ever get one-day deep. However, it does give stable and predictable job 'chunks' and H3 can cope.The old code had become confused due to the split between the stable system and the document harvesting system. The code has been merged, but needs updating and testing, with the document harvester approach being supported appropriately.
To Do
validate job
stage, scanning crawl.log, checking for WARCs in HDFS and uploading logs and job files to HDFS.build sip
to Celery.submit sip
to Celery.w3act-info.json
to the SIP.watched-surts.txt
file alongsidesurts.txt
dev
AMQPIndexableCrawlLogFeed for CDX and documents.uristocdxserver
anddocstow3act
to Celery (message interop?)Other ideas
validate job
, e.g. check logs are not empty, check there are no additional WARC files, i.e. not mentioned in logs, etc.validate sip
to inspect the store for content and verify it.