ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Extend URL Reciever to allow different event stores to be used #22

Open anjackson opened 5 years ago

anjackson commented 5 years ago

The KafkaUrlReceiver could be refactored to offer different storage options, e.g.

This would allow the same continuous crawling behaviour to be used without requiring Kafka. This would make it easier to others to experiment with our crawl set-up more easily. But it would significantly increase the integration testing needed, will have no log compression, and we may not use it.

anjackson commented 3 years ago

Might be best to just transition to using Tailer via a TailerListenerAdapter, and make Heritrix rotate the input fc.tocrawl.jsonl files upon checkpointing, so we just re-read the last input list when resuming, as this is desirable behaviour. No need to store offsets etc.

Add a separate Logstash service with HTTP Input and JSONL Output and we have a launch API.

Being able to do all this using simple log files would make deployment much easier for others, as well as for us. e.g. rotation can gzip files to save space. Can use CLI to 'pour in' large amounts of URIs by cat lorralorraurls.jsonl >> fc.tocrawl.jsonl etc.