When one crawler froze up, and we switched to another, this caused problems because both were running the same networked Gluster filesystem (used for Kafka, Prometheus) whereas the crawl state (frontier and caches) were locally held. This caused problems with Kafka and Prometheus on startup.
This ticket is to consider how to handle this:
Only have crawl output on Gluster?
Move Kafka/Prometheus/etc. onto local disk?
Make Kafka an distinct, fully distributed service? (Similar to how the crawl-time CDX is a separate service).
And improve documentation to cover crawler failover.
When one crawler froze up, and we switched to another, this caused problems because both were running the same networked Gluster filesystem (used for Kafka, Prometheus) whereas the crawl state (frontier and caches) were locally held. This caused problems with Kafka and Prometheus on startup.
This ticket is to consider how to handle this: