Closed Ritu-Thakur closed 7 years ago
@rustyrazorblade I still see this issue in 0.6.2. When single node in a multi-dc is unreachable then all repair sessions are halted even though allowUnreachableNodes is true.
Hi @ritu0407,
this is actually a Cassandra limitation. Cassandra won't repair a token range if some of its replicas are down. There's going to be an option in Cassandra 4.0 to allow repair with down nodes : https://issues.apache.org/jira/browse/CASSANDRA-10446
We could theoretically reproduce this behavior in Reaper by adding an option to limit the host used for repair to those that are up at the time the job starts.
It would be the equivalent of nodetool repair -hosts
.
@michaelsembwever @rustyrazorblade : what do you think of this ?
We could theoretically reproduce this behavior in Reaper by adding an option to limit the host used for repair to those that are up at the time the job starts.
-1 it only introduces the variable about whether the data is really repaired, regardless of having run a successful repair.
-1 as well
ok folks, closing the issue then.
In our testing of reaper on 2DC-6C* nodes, we found that when one node is down then the tool starts throwing JMX error and not repairing anything.
Found that "allowUnreachableNodes" property controls it, here is description of the property written in readme file: " The allowUnreachableNodes parameter in cassandra-reaper.yaml must then be set to true in order for Reaper to control the repair process through the reachable nodes only. Limitations of this setup are:
In our case allowUnreachableNodes was "false" so changed it to to "true", now it's trying to repair but failing later(below are the logs). Is there any other property need to be set which makes reaper work even if few nodes are down?
INFO [2017-08-10 17:02:04,921] [mailboxcluster:b268d570-7e14-11e7-8baf-5d222de23ff3:b26be2b0-7e14-11e7-8baf-5d222de23ff3] c.s.r.s.SegmentRunner - It is ok to repair segment 'b26be2b0-7e14-11e7-8baf-5d222de23ff3' on repair run with id 'b268d570-7e14-11e7-8baf-5d222de23ff3' INFO [2017-08-10 17:02:04,922] [mailboxcluster:b268d570-7e14-11e7-8baf-5d222de23ff3:b26be2b0-7e14-11e7-8baf-5d222de23ff3] c.s.r.c.JmxProxy - Triggering repair of range (-2397861424239944058,-2394213794648335902] for keyspace "mailbox" on host x.xx.xx.xxx, with repair parallelism parallel, in cluster with Cassandra version '3.11.0' (can use DATACENTER_AWARE 'true'), for column families: [] INFO [2017-08-10 17:02:04,947] [mailboxcluster:b268d570-7e14-11e7-8baf-5d222de23ff3:b26be2b0-7e14-11e7-8baf-5d222de23ff3] c.s.r.s.SegmentRunner - Repair for segment b26be2b0-7e14-11e7-8baf-5d222de23ff3 started, status wait will timeout in 18000000 millis WARN [2017-08-10 17:02:05,987] [mailboxcluster:b268d570-7e14-11e7-8baf-5d222de23ff3:b26be2b0-7e14-11e7-8baf-5d222de23ff3] c.s.r.s.SegmentRunner - repair session failed for segment with id 'b26be2b0-7e14-11e7-8baf-5d222de23ff3' and repair number '20' INFO [2017-08-10 17:02:05,990] [mailboxcluster:b268d570-7e14-11e7-8baf-5d222de23ff3:b26be2b0-7e14-11e7-8baf-5d222de23ff3] c.s.r.s.SegmentRunner - Postponing segment b26be2b0-7e14-11e7-8baf-5d222de23ff3 INFO [2017-08-10 17:02:05,996] [mailboxcluster:b268d570-7e14-11e7-8baf-5d222de23ff3:b26be2b0-7e14-11e7-8baf-5d222de23ff3] c.s.r.s.SegmentRunner - Repair command 20 on segment b26be2b0-7e14-11e7-8baf-5d222de23ff3 returned with state NOT_STARTED