ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

UKWA Heritrix

This repository takes Heritrix3 and adds in code and configuration specific to the UK Web Archive. It is used to build a Docker image that is used to run our crawls.

Local Development

If you are modifying the Java code and want to compile it and run the unit tests, you can use:

$ mvn clean install

However, as the crawler is a multi-component system, you'll also want to run integration tests.

Continuous Integration Testing

All tags, pushes and pull-requests on the main ukwa-heritrix repository will run integration testing before pushing an updated Docker container image. See the workflows here.

However, it is recommended that you understand and run the integration tests locally first.

Local Integration Testing

The supplied Docker Compose file can be used for local testing. This looks quite complex because the system spins up many services, including ones that are only needed for testing:

Docker Compose ensemble visualisation

IMPORTANT there is a .env file that docker-compose.yml uses to pick up shared variables. This includes the user UID that is used to run the services. This should be overridden using whatever UID you develop under. e.g.

$ export CRAWL_UID=$(id -u)

There's a little helper script to do this, which you can run like this before running Docker operations:

$ source source-setup-crawl-uid.sh

To run the tests locally, build the images:

$ docker-compose build

This builds the heritrix and robot images.

Note that the Compose file is set up to pass the HTTP_PROXY and HTTPS_PROXY environment variables through to the build environment, so as long as those are set, it should build behind a corporate web proxy. If you are not behind a proxy, and these variables are not set, docker-compose will warn that the variables are not set, but the build should work nevertheless.

To run the integration tests:

$ docker-compose up

Alternatively, to launch the crawler for manual testing, use e.g. (listing heritrix warcprox webrender we make sure we see logs from those three containers):

$ docker-compose up heritrix warcprox webrender

and use a secondary terminal to e.g. launch crawls. Note that ukwa-heritrix is configured to wait a few seconds before auto-launching the frequent crawl job.

After running tests, it's recommended to run:

$ docker-compose rm -f
$ mvn clean

This deletes all the crawl output and state files, thus ensuring that subsequent runs start from a clean slate.

Service Endpoints

Once running, these are the most useful services for experimenting with the crawler itself:

Service Endpoint Description
Heritrix https://localhost:8443/ (username/password heritrix/heritrix) The main Heritrix crawler control interface.
Kafka UI http://localhost:9000/ A browser UI that lets you look at the Kafka topics.
Crawl CDX http://localhost:9090/ An instance of OutbackCDX used to record crawl outcomes for analysis and deduplication. Can be used to look up what happened to a URL during the crawl.
Wayback http://localhost:8080/ An instance of OpenWayback that allows you to play back the pages that have been crawled. Uses the Crawl CDX to look up which WARCs hold the required URLs.

Note that the Heritrix REST API documentation contains some useful examples of how to interact with Heritrix using curl.

There are a lot of other services, but these are largely intended for checking or debugging:

Service Endpoint Description
Heritrix (JMX) localhost:9101 Java JMX service used to access internal state for monitoring the Kafka client. (DEPRECATED)
Heritrix (Prometheus) http://localhost:9119/ Crawler bean used to collect crawler metrics and publish them for Prometheus
More TBA

Manual testing

The separate crawl-streams utilities can be used to interact with the logs/streams that feed URLs into the crawl, and document the URLs found and processed by the crawler. To start crawling the two test sites, we use:

$ docker run --net host ukwa/crawl-streams submit -k localhost:9092 fc.tocrawl -S http://acid.matkelly.com/
$ docker run --net host ukwa/crawl-streams submit -k localhost:9092 fc.tocrawl -S http://crawl-test-site.webarchive.org.uk/

Note that the --net host part means the Docker container can talk to your development machine directly as localhost, which is the easiest way to reach your Kafka instance.

The other thing to note is the -S flag - this indicates that these URLs are seeds, and that means when the crawler pickes them up, it will widen the scope of the crawl to include any URLs that are on those sites (strictly, those URLs that have this URL as a prefix when expressed in SURT form. Without the -S flag, submitted URLs will be ignored unless they are within the current crawler scope.

Note, however, some extra URLs may be discovered during processing that are necessary for in scope URLs to work (e.g. images, CSS, JavaScript etc.). The crawler is configured to fetch these even if they are out of the main crawl scope. i.e. the crawl scope is intended to match up with the HTML pages that are of interest. Any further resources required by those changes will be added if the crawler determines they are needed.

Directly interacting with Kafka

It's also possible to interact directly with Kafka by installing and using the standard Kafka tools. This is not recommended at present, but these instructions are left here in case they are helpful:

cat testdata/seed.json | kafka-console-producer --broker-list kafka:9092 --topic fc.tocrawl
kafka-console-consumer --bootstrap-server kafka:9092 --topic fc.tocrawl --from-beginning
kafka-console-consumer --bootstrap-server kafka:9092 --topic fc.crawled --from-beginning

Automated testing

The robot container runs test crawls over the two test sites mentioned in the previous section. The actions and expected results are in the crawl-test-site.robot test specification.

Crawl Configuration

We use Heririx3 Sheets as a configuration mechanism to allow the crawler behaviour to change based on URL SURT prefix.

Summary of Heritrix3 Modules

Modules for Heritrix 3.4.+

Release Process

We only need tagged builds, so

mvn release:clean release:prepare

is sufficient to tag a version and initiate a Docker container build. Note that the SCM/git tag should be of the form X.Y.Z.

Redis Notes

Some experimental code uses a Redis back end. This should support multiple implementations, but subtleties around transactions, distribution, and syntax remain.

e.g. KvRocks is great but does not support things like ZADD with LT. The LT option was added recently (Redis 6.2), so does not have wide support elsewhere. Consider using two ops instead.

Changes