Open marcusb opened 9 years ago
It looks like the "Exception: null" happened when one of the JMX calls failed (for mundane reasons).
We tweaked our approach to this. We agreed that ERROR should mean nothing else than "unrecoverable error", and simply don't set the repair run to that state unless it's a known unrecoverable (repair segment mismatch with cluster topology is the only known one for now). Now we keep retrying if the run is hit by exceptions that we don't handle anywhere.
Hopefully that doesn't become a problem in and off itself. Better than retrying when we already know that it's not going to work at least.
The repair often ends up in ERROR state if nodes are down or restarted. Sometimes the message is "Exception: null". After this happens, the repair must be resumed manually with spreaper. It would be preferable if it would resume automatically perhaps after some delay.