An experiment aimed at building a scaleable, modular web archive system based on Docker Compose and possibly, Apache Storm.
The production version of this approach is held in the Pulse project.
To make any progress, we need to be able to effectively compare any new crawler with our current system. Therefore, we start by reproducing our existing crawl system via Docker Compose, and check we fully understand it before attempting to make any modifications. We will then look at ways of modifying, replacing or removing our current components in order to make the whole system more maintainable, manageable and scalable.
Our goals are:
Freely lifting useful ideas from:
Most of the folders in this repository are distinct Dockerized services. The folders beginning with compose-
contain docker-compose.yml
files that assemble these individual services into larger, integrated systems.
Where the services are under active development, the service folder is a git
submodule
, pulling in the original repository and building it directly inside this parent project. This makes integrated development and testing much easier. However, if you clone this repository, you'll probably want to do so recursively, like this:
$ git clone --recursive git@github.com:anjackson/wren.git
This will go and pull down all the submodules
at the same time as the original clone.
As individual services stabilize, it should be possible to remove these submodules and run the Docker images instead.
Problems:
FC-1-uris-to-check FC-2-uris-to-render FC-3-uris-to-crawl FC-4-uris-to-index
FI-1-checkpoints-to-package
warcprox_meta["warc-prefix"] from request header in JSON:
Warcprox-Meta: { "warc-prefix": "PREFIX"}
We are evaluating whether Apache Storm provides a useful framework for modularizing and scaling the core crawl process itself. In particular, the way the framework provides guaranteed message processing (e.g. at-least-once semantics) should help ensure the integrity of the system.
Wren includes a prototype replacement for our suite of Python-based scripts that render URLs that are part of a Heritrix crawl in order to determine the URLs of dynamically transcluded dependencies.
We also need to reliably launch our regular crawls. The current system relies on a script (w3start.py) that is launched by and hourly cron job. However, if something goes wrong during the launch process, the system cannot retry. A better option is to use the cron job only to place the crawl request on a queue, and use a daemon process to watch that queue and launch the script.
One option is to create a normal server daemon process. We've tended to do this in the past, but this has led to various important services being spread over a number of machines. This makes the dependencies difficult to manage and the processing difficult to monitor.
Using Storm would allow us to centralize these daemons and integrate them into our overall monitoring approach. They would also retry robustly and be less dependent on specific hardware systems.