ukwa / wren

Experiments in testable, scaleable crawler architectures
GNU Affero General Public License v3.0
5 stars 3 forks source link

Wren

An experiment aimed at building a scaleable, modular web archive system based on Docker Compose and possibly, Apache Storm.

The production version of this approach is held in the Pulse project.

To make any progress, we need to be able to effectively compare any new crawler with our current system. Therefore, we start by reproducing our existing crawl system via Docker Compose, and check we fully understand it before attempting to make any modifications. We will then look at ways of modifying, replacing or removing our current components in order to make the whole system more maintainable, manageable and scalable.

Our goals are:

Freely lifting useful ideas from:

Folder structure

Most of the folders in this repository are distinct Dockerized services. The folders beginning with compose- contain docker-compose.yml files that assemble these individual services into larger, integrated systems.

Where the services are under active development, the service folder is a git submodule, pulling in the original repository and building it directly inside this parent project. This makes integrated development and testing much easier. However, if you clone this repository, you'll probably want to do so recursively, like this:

$ git clone --recursive git@github.com:anjackson/wren.git

This will go and pull down all the submodules at the same time as the original clone.

As individual services stabilize, it should be possible to remove these submodules and run the Docker images instead.

Problems:

Queue-based Harvest Workflow

FC-1-uris-to-check FC-2-uris-to-render FC-3-uris-to-crawl FC-4-uris-to-index

FI-1-checkpoints-to-package

warcprox

warcprox_meta["warc-prefix"] from request header in JSON:

Warcprox-Meta: { "warc-prefix": "PREFIX"}

Wren Storm Topologies

We are evaluating whether Apache Storm provides a useful framework for modularizing and scaling the core crawl process itself. In particular, the way the framework provides guaranteed message processing (e.g. at-least-once semantics) should help ensure the integrity of the system.

Elastic Web Rendering

Wren includes a prototype replacement for our suite of Python-based scripts that render URLs that are part of a Heritrix crawl in order to determine the URLs of dynamically transcluded dependencies.

Robust Crawl Launching

We also need to reliably launch our regular crawls. The current system relies on a script (w3start.py) that is launched by and hourly cron job. However, if something goes wrong during the launch process, the system cannot retry. A better option is to use the cron job only to place the crawl request on a queue, and use a daemon process to watch that queue and launch the script.

One option is to create a normal server daemon process. We've tended to do this in the past, but this has led to various important services being spread over a number of machines. This makes the dependencies difficult to manage and the processing difficult to monitor.

Using Storm would allow us to centralize these daemons and integrate them into our overall monitoring approach. They would also retry robustly and be less dependent on specific hardware systems.

CDX/Remote Resource Index Servers

Remote Browsers

End-to-End Testing