lqid commented 7 years ago

Problem Description:

After starting repair via the GUI, progress remains at 0/x.
Cassandra nodes calculate their respective token ranges, and then nothing happens.
There were no errors in the Reaper or Cassandra logs. Only a message of acknowledgement that a repair had initiated.
Performing stack trace on the running JVM, once can see that the thread spawning the repair process was waiting on a lock that was never being released.
This occurred on all nodes, and prevented any manually initiated repair process from running. A rolling restart of each node was required, after which one could run a nodetool repair successfully.

Cassandra Cluster Details:

Cassandra 2.2.5 running on Windows Server 2008 R2
6 node cluster, split across 2 DCs, with RF = 3:3.

Reaper Details:

Reaper 0.3.3 running on Windows Server 2008 R2, utilising a PostgreSQL database.

Reaper settings:

Parallism: DC-Aware
Repair Intensity: 0.9
Incremental: true

Don't want to swamp you with more details or unnecessary logs, especially as I'd have to sanitize them before sending them out, so please let me know if there is anything else I can provide, and I'll do my best to get it to you.

adejanovski commented 7 years ago

Hi @lqid ,

it is possible that Reaper cannot reach some nodes through JMX, especially across DCs. Could you please try to run a repair through the GUI, then activate it and click on the repair to open the details panel ? What's written on the last event line ?

If there's nothing obvious here we'll need to go through the logs in order to find what's wrong. Things you can try to narrow the problem down :

Run a full repair instead of an incremental one
Run reaper in memory mode instead of database to check if the storage backend is the problem

zznate commented 7 years ago

Cassandra 2.2.5 running on Windows Server 2008 R2

@lqid I very much recommend updating to 2.2.8, even 2.2.9 tip (we may not formally release 2.2 again) as there are a number of minor streaming and repair issues fixed in between those versions, most relevant is sane streaming timeouts by default.

lqid commented 7 years ago

Hi @adejanovski Regarding the JMX connection, I've made sure that the server Reaper resides on is able to remotely connect via JMX to each of the Cassandra nodes via JConsole.

Last event reads: Triggered repair of segment 4669 via host node1 Repaired progress bar remains at 0/6

Looking in the Reaper logs, the last message relating to a repair is:

DEBUG  [2017-01-04 08:45:48,251] [ppe_cass1_c1] c.s.r.c.JmxProxy - Received notification: javax.management.Notification[source=repair:1][type=progress][message=Repair completed successfully] 
DEBUG  [2017-01-04 08:45:48,251] [ppe_cass1_c1] c.s.r.c.JmxProxy - Received notification: javax.management.Notification[source=repair:1][type=progress][message=Repair command #1 finished in 3 minutes 37 seconds]

As per Bhuvan Rawal's email to user@cassandra.apache.org, I've also tried adjusting Reaper configuration to repairRunThreadCount: 1, which had no apparent effect.

As for running a full repair instead of an incremental one, I had already tried that, with the same result. I'll run Reaper in memory mode after I give this some time, but I suspect I'll need to do a cluster restart again.

@zznate I'll definitely take that to heart, and I do agree with upgrading to the latest version just on principle. I'll run with that as soon as the opportunity comes up for us to upgrade.

lqid commented 7 years ago

Good news, and bad...

I rescind my previous comment of adjusting repairRunThreadCount: 1 having no effect. Before modifying this, repairs would "hang" (as the title of this issue suggests) with threads just doing nothing indefinitely, however, now I am seeing log messages on both the Reaper server and Cassandra nodes with normal repair progress messages, albeit them coming through very slowly. (To expected with such a low thread count, I assume?).

Last event is also being updated as below... Last event reads: Triggered repair of segment 4667 via host node3

Again, in Reaper logs, notice time stamp and delta from previous comment:

DEBUG  [2017-01-04 10:17:40,320] [ppe_cass1_c1] c.s.r.c.JmxProxy - Received notification: javax.management.Notification[source=repair:1][type=progress][message=Repair completed successfully] 
DEBUG  [2017-01-04 10:17:40,320] [ppe_cass1_c1] c.s.r.c.JmxProxy - Received notification: javax.management.Notification[source=repair:1][type=progress][message=Repair command #1 finished in 3 minutes 56 seconds]

Note that Repaired progress bar still remains at 0/6. Not sure how the progress bar denominator is calculated, to be honest(?)

adejanovski commented 7 years ago

@lqid : I'm able to reproduce the problem using a CCM cluster with Cassandra 2.2.5. The acceptance test suite fails as the first segment is never marked as DONE. Running it with Cassandra 2.2.8 works fine though.

I've traced the problem back to CASSANDRA-11430 : we're still using the deprecated repair methods in Reaper, which didn't properly handle notifications in Cassandra 2.2 until 2.2.6.

I'd support @zznate recommendation to upgrade to the latest 2.2 in order to have properly working repairs.

We have an open issue for switching to non deprecated repair methods but no ETA yet.

lqid commented 7 years ago

Understood. Thank you all for the support and clear explanations.

thelastpickle / cassandra-reaper

Reaper repair seems to "hang" #39

Problem Description:

Cassandra Cluster Details:

Reaper Details:

Reaper settings: