The redis writers connectivity appears unreliable.
Changes
extend redis writer restart intensity to 15000 restarts within 30 seconds
add some housekeeping for redis writers in logplex_queue
don't let the writer exit normally on unexpected responses from a logplex_queue process
Details
The prior restart intensity of the redis writer supervisor configuration did allow for 1000 restarts per 1 second. This is problematic for restarting redis writers, for example, when there are 100 shards and 10 writers per shard. On a network connectivity problem the number of restarts gets easily exceeded which forces a redis writer supervisor restart. After a supervisor restart all prior created connection information is lost. Without manual intervention the connection information is not automatically recovered.
The logplex_queue processes hold a list of workers for book keeping. This list doesn't have a function except for introspection. Without the change here this list becomes outdated as writer connections to redis go away.
A redis writer process would exit normally which prevents a automatic restart by the supervisor on unexpected errors when fetching messages from its logplex_queue process.
Rationale
The redis writers connectivity appears unreliable.
Changes
Details
The prior restart intensity of the redis writer supervisor configuration did allow for 1000 restarts per 1 second. This is problematic for restarting redis writers, for example, when there are 100 shards and 10 writers per shard. On a network connectivity problem the number of restarts gets easily exceeded which forces a redis writer supervisor restart. After a supervisor restart all prior created connection information is lost. Without manual intervention the connection information is not automatically recovered.
The logplex_queue processes hold a list of workers for book keeping. This list doesn't have a function except for introspection. Without the change here this list becomes outdated as writer connections to redis go away.
A redis writer process would exit normally which prevents a automatic restart by the supervisor on unexpected errors when fetching messages from its logplex_queue process.