spotify / cassandra-reaper

Software to run automated repairs of cassandra
235 stars 60 forks source link

resume repairs automatically after errors #106

Open marcusb opened 9 years ago

marcusb commented 9 years ago

The repair often ends up in ERROR state if nodes are down or restarted. Sometimes the message is "Exception: null". After this happens, the repair must be resumed manually with spreaper. It would be preferable if it would resume automatically perhaps after some delay.

rzvoncek commented 9 years ago

It looks like the "Exception: null" happened when one of the JMX calls failed (for mundane reasons).

107 adds extra check for this, as well as automatically resumes a run that is in ERROR.

Bj0rnen commented 9 years ago

We tweaked our approach to this. We agreed that ERROR should mean nothing else than "unrecoverable error", and simply don't set the repair run to that state unless it's a known unrecoverable (repair segment mismatch with cluster topology is the only known one for now). Now we keep retrying if the run is hit by exceptions that we don't handle anywhere.

Hopefully that doesn't become a problem in and off itself. Better than retrying when we already know that it's not going to work at least.