Open vchekan opened 9 years ago
Initial idea of implementing error recovery was to have single point, in tcp connection object where all error recovery would be made. But it turned out too low level, because depending on what this connection is doing, we might want different strategy for recovery. When RecoveryMonitor tests either broker is available, we do not want any recovery and want fail fast instead. Another issue is that currently only fetcher and producer are well protected, whereas earlier stages, such as connection, offset resolution, metadata fetching are more fragile, or have to implement their own recovery, thus polluting the code.
Fast reaction: consider partition failed immediately after tcp error, and not after many retries. Remember to fail all request tasks which are waiting for responses. While waiting for recovery, pay attention to changes in metadata Work nicely with shutdown and drain logic
Currently error processing is focused around kafka broker recovery. Connecting to broker, fetching offsets is not reliable and failures are not handled properly, leading to occasional random behavior.