Reaper doesn't work with one node down in 2DC-6 Cassandra nodes even though allowUnreachableNodes is true

Ritu-Thakur commented 7 years ago

In our testing of reaper on 2DC-6C* nodes, we found that when one node is down then the tool starts throwing JMX error and not repairing anything.

Found that "allowUnreachableNodes" property controls it, here is description of the property written in readme file: " The allowUnreachableNodes parameter in cassandra-reaper.yaml must then be set to true in order for Reaper to control the repair process through the reachable nodes only. Limitations of this setup are:

All keyspaces must be replicated on the reachable DC using NetworkTopologyStrategy
Reaper won't be able to check the unreachable DC nodes for pending compactions or running repairs, which disables repair overload prevention Leaving allowUnreachableNodes to false will prevent all repair sessions once a single node from the cluster is unreachable."

In our case allowUnreachableNodes was "false" so changed it to to "true", now it's trying to repair but failing later(below are the logs). Is there any other property need to be set which makes reaper work even if few nodes are down?

INFO [2017-08-10 17:02:04,921] [mailboxcluster:b268d570-7e14-11e7-8baf-5d222de23ff3:b26be2b0-7e14-11e7-8baf-5d222de23ff3] c.s.r.s.SegmentRunner - It is ok to repair segment 'b26be2b0-7e14-11e7-8baf-5d222de23ff3' on repair run with id 'b268d570-7e14-11e7-8baf-5d222de23ff3' INFO [2017-08-10 17:02:04,922] [mailboxcluster:b268d570-7e14-11e7-8baf-5d222de23ff3:b26be2b0-7e14-11e7-8baf-5d222de23ff3] c.s.r.c.JmxProxy - Triggering repair of range (-2397861424239944058,-2394213794648335902] for keyspace "mailbox" on host x.xx.xx.xxx, with repair parallelism parallel, in cluster with Cassandra version '3.11.0' (can use DATACENTER_AWARE 'true'), for column families: [] INFO [2017-08-10 17:02:04,947] [mailboxcluster:b268d570-7e14-11e7-8baf-5d222de23ff3:b26be2b0-7e14-11e7-8baf-5d222de23ff3] c.s.r.s.SegmentRunner - Repair for segment b26be2b0-7e14-11e7-8baf-5d222de23ff3 started, status wait will timeout in 18000000 millis WARN [2017-08-10 17:02:05,987] [mailboxcluster:b268d570-7e14-11e7-8baf-5d222de23ff3:b26be2b0-7e14-11e7-8baf-5d222de23ff3] c.s.r.s.SegmentRunner - repair session failed for segment with id 'b26be2b0-7e14-11e7-8baf-5d222de23ff3' and repair number '20' INFO [2017-08-10 17:02:05,990] [mailboxcluster:b268d570-7e14-11e7-8baf-5d222de23ff3:b26be2b0-7e14-11e7-8baf-5d222de23ff3] c.s.r.s.SegmentRunner - Postponing segment b26be2b0-7e14-11e7-8baf-5d222de23ff3 INFO [2017-08-10 17:02:05,996] [mailboxcluster:b268d570-7e14-11e7-8baf-5d222de23ff3:b26be2b0-7e14-11e7-8baf-5d222de23ff3] c.s.r.s.SegmentRunner - Repair command 20 on segment b26be2b0-7e14-11e7-8baf-5d222de23ff3 returned with state NOT_STARTED

Ritu-Thakur commented 7 years ago

@rustyrazorblade I still see this issue in 0.6.2. When single node in a multi-dc is unreachable then all repair sessions are halted even though allowUnreachableNodes is true.

adejanovski commented 7 years ago

Hi @ritu0407,

this is actually a Cassandra limitation. Cassandra won't repair a token range if some of its replicas are down. There's going to be an option in Cassandra 4.0 to allow repair with down nodes : https://issues.apache.org/jira/browse/CASSANDRA-10446

We could theoretically reproduce this behavior in Reaper by adding an option to limit the host used for repair to those that are up at the time the job starts. It would be the equivalent of nodetool repair -hosts.

@michaelsembwever @rustyrazorblade : what do you think of this ?

michaelsembwever commented 7 years ago

We could theoretically reproduce this behavior in Reaper by adding an option to limit the host used for repair to those that are up at the time the job starts.

-1 it only introduces the variable about whether the data is really repaired, regardless of having run a successful repair.

rustyrazorblade commented 7 years ago

-1 as well

adejanovski commented 7 years ago

ok folks, closing the issue then.

thelastpickle / cassandra-reaper

Reaper doesn't work with one node down in 2DC-6 Cassandra nodes even though allowUnreachableNodes is true #156