Open anjackson opened 5 years ago
Might be best to just transition to using Tailer via a TailerListenerAdapter, and make Heritrix rotate the input fc.tocrawl.jsonl
files upon checkpointing, so we just re-read the last input list when resuming, as this is desirable behaviour. No need to store offsets etc.
Add a separate Logstash service with HTTP Input and JSONL Output and we have a launch API.
Being able to do all this using simple log files would make deployment much easier for others, as well as for us. e.g. rotation can gzip files to save space. Can use CLI to 'pour in' large amounts of URIs by cat lorralorraurls.jsonl >> fc.tocrawl.jsonl
etc.
The KafkaUrlReceiver could be refactored to offer different storage options, e.g.
This would allow the same continuous crawling behaviour to be used without requiring Kafka. This would make it easier to others to experiment with our crawl set-up more easily. But it would significantly increase the integration testing needed, will have no log compression, and we may not use it.