Wren

An experiment aimed at building a scaleable, modular web archive system based on Docker Compose and possibly, Apache Storm.

The production version of this approach is held in the Pulse project.

To make any progress, we need to be able to effectively compare any new crawler with our current system. Therefore, we start by reproducing our existing crawl system via Docker Compose, and check we fully understand it before attempting to make any modifications. We will then look at ways of modifying, replacing or removing our current components in order to make the whole system more maintainable, manageable and scalable.

Our goals are:

Fewer moving parts (less to maintain)
Based on a scalable parallel processing framework (manually scaling is hard)
Robust, guaranteed processing of requests (won't drop URLs by accident)

Freely lifting useful ideas from:

storm-crawler.
Brozzler, an IA's distributed browser-based web crawler build on Docker which works along similar lines.
Browsertrix, which is currently more of a render assistant than a crawler, but leverages Docker Compose.
Various Dockerised OpenWayback images, LOCKSS, UNB Libraries, Sawood Alam.

Folder structure

Most of the folders in this repository are distinct Dockerized services. The folders beginning with compose- contain docker-compose.yml files that assemble these individual services into larger, integrated systems.

Compositions
- UKWA Heritrix3 Test Crawl System
- Scale-out Archiving Proxy

Where the services are under active development, the service folder is a git submodule, pulling in the original repository and building it directly inside this parent project. This makes integrated development and testing much easier. However, if you clone this repository, you'll probably want to do so recursively, like this:

$ git clone --recursive git@github.com:anjackson/wren.git

This will go and pull down all the submodules at the same time as the original clone.

As individual services stabilize, it should be possible to remove these submodules and run the Docker images instead.

Problems:

Redirects not working as via looks backwards not forwards! - May need separate logfeeder.
https://www.webarchive.org.uk/act/wayback/20150320113132/http://www.bbc.co.uk/news/ then https://www.webarchive.org.uk/act/wayback/20150324131847/http://www.bbc.co.uk/news and BBC rollout articles?
http://www.theguardian.com/help/insideguardian/2015/jan/28/welcome-to-the-new-guardian-website

Queue-based Harvest Workflow

FC-1-uris-to-check FC-2-uris-to-render FC-3-uris-to-crawl FC-4-uris-to-index

FI-1-checkpoints-to-package

warcprox

warcprox_meta["warc-prefix"] from request header in JSON:

Warcprox-Meta: { "warc-prefix": "PREFIX"}

Wren Storm Topologies

We are evaluating whether Apache Storm provides a useful framework for modularizing and scaling the core crawl process itself. In particular, the way the framework provides guaranteed message processing (e.g. at-least-once semantics) should help ensure the integrity of the system.

Elastic Web Rendering

Wren includes a prototype replacement for our suite of Python-based scripts that render URLs that are part of a Heritrix crawl in order to determine the URLs of dynamically transcluded dependencies.

Robust Crawl Launching

We also need to reliably launch our regular crawls. The current system relies on a script (w3start.py) that is launched by and hourly cron job. However, if something goes wrong during the launch process, the system cannot retry. A better option is to use the cron job only to place the crawl request on a queue, and use a daemon process to watch that queue and launch the script.

One option is to create a normal server daemon process. We've tended to do this in the past, but this has led to various important services being spread over a number of machines. This makes the dependencies difficult to manage and the processing difficult to monitor.

Using Storm would allow us to centralize these daemons and integrate them into our overall monitoring approach. They would also retry robustly and be less dependent on specific hardware systems.

CDX/Remote Resource Index Servers

Various web archiving components may benefit from having the CDX index as an independent, scaleable service rather than the usual files.
If the CDX server also present an API for updating its index, as well as reading it, it can act as a core, standalone component in a modular architecture.
Potential uses include: playback, de-duplication, 'last seen' state during crawls.
The tinycdxserver Dockerfile sets up NLA's read/writable Remote Resource Index server (based on RocksDB) for experimentation. See https://gist.github.com/ato/b2ad8e65b35afe690921 for information on using it.
The read-only CDX servers (pywb,OpenWayback), could be unified and extended in this direction.
Note that warcbase and OpenWayback can be used together for very large indexes that are best stored in HBase.

Remote Browsers

End-to-End Testing

Use Splinter?

ukwa / wren

readme