Postponed a segment because no coordinator was reachable

RolandOtta commented 7 years ago

Hi folks,

we sometimes get error message "Postponed a segment because no coordinator was reachable" when using imcremental repairs in our cassandra 3.10 production cluster.

the repair does not recover from that point. we have to stop the incremental repair and start a new one .. the new repair then normally works without any issues

when having this error we can see the following in the creaper log

DEBUG [2017-05-15 07:34:05,770] [productioncluster:93:61445] c.s.r.c.JmxConnectionFactory - Unreachable host com.spotify.reaper.ReaperException: Null host given to JmxProxy.connect() at com.spotify.reaper.cassandra.JmxProxy.connect(JmxProxy.java:110) ~[creaper.jar:0.5.1-SNAPSHOT] at com.spotify.reaper.cassandra.JmxConnectionFactory.connect(JmxConnectionFactory.java:50) ~[creaper.jar:0.5.1-SNAPSHOT] at com.spotify.reaper.cassandra.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:69) ~[creaper.jar:0.5.1-SNAPSHOT] at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:148) [creaper.jar:0.5.1-SNAPSHOT] at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:93) [creaper.jar:0.5.1-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77] WARN [2017-05-15 07:34:05,770] [productioncluster:93:61445] c.s.r.s.SegmentRunner - Failed to connect to a coordinator node for segment 61445 com.spotify.reaper.ReaperException: no host could be reached through JMX at com.spotify.reaper.cassandra.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:75) ~[creaper.jar:0.5.1-SNAPSHOT] at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:148) [creaper.jar:0.5.1-SNAPSHOT] at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:93) [creaper.jar:0.5.1-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]

according to nodetool status all cluster nodes are in state up/normal

br, roland

ostefano commented 7 years ago

Seeing exactly the same thing (version 3.0.14). More details here #92

ostefano commented 7 years ago

Happened again, table repair_run shows one coordinator being null

92165cf0-631d-11e7-a844-dbfd17b7d833 | 9216d229-631d-11e7-a844-dbfd17b7d833 | no cause specified | cassandracluster | 2017-07-07 14:07:11+0000 | 2017-07-07 14:07:14+0000 |       0.9 | Postponed a segment because no coordinator was reachable | Stefano | 2017-07-07 14:07:14+0000 |           parallel | 9206a580-631d-11e7-a844-dbfd17b7d833 |            13 | 2017-07-07 14:07:14+0000 | RUNNING |             null | -6403394277111986699 |         91 | 2017-07-07 14:07:14+0000 |                     null |             0 | -6414212751690768970

ostefano commented 7 years ago

@adejanovski, I am trying to understand why that might happen in the context of incremental repairs.

During my tests, the coordinator is never null when starting the repair, but only after a number of steps have been completed (and necessarily, after a segment has been postponed at least once).

Based on what I see SegmentRunner.postpone should never set the coordinator to null when postponing a segment, so I am a bit lost.

Ideas where to look further?

adejanovski commented 7 years ago

Using incremental repairs, we should indeed never ever set the coordinator to null so if that happens then there's a code path that still allows to null it.

I'll inspect the code shortly and come up with a proper patch.

ostefano commented 7 years ago

Cool! Let me know the branch and I will test it right away. Thx!

ostefano commented 7 years ago

@adejanovski , did you manage to give it a look by any chance? Thx a lot!

adejanovski commented 7 years ago

Hi @ostefano,

sorry for the time it took but I was able to reproduce and fix the issue. I've created PR #146 with the fix.

If a segment cannot get repaired within the timeout, abort() is called but fails to provide the RepairUnit : https://github.com/thelastpickle/cassandra-reaper/blob/master/src/main/java/com/spotify/reaper/service/SegmentRunner.java#L123

The PR provides the RepairUnit to abort() which then detects it's an incremental repair and doesn't void coordinator_host no more.

Could you test the branch and tell us if it works ?

Thanks

ostefano commented 7 years ago

Hi @adejanovski, thx a lot!

I have been testing ft-reaper-improvements-final in the meanwhile. Do you think I can just cherry-pick that commit and run ft-reaper-improvements-final + PR 146?

Thanks

adejanovski commented 7 years ago

Hi @ostefano,

yes totally, and we'll soon rebase ft-reaper-improvements-final over master anyway. Also, I've just recently added proper support for incremental repair in ft-reaper-improvements-final when running multiple reaper instances.

ostefano commented 7 years ago

Hi @adejanovski , been testing the patch and all seems good. Thx for fixing this!

thelastpickle / cassandra-reaper

Postponed a segment because no coordinator was reachable #103