Closed RolandOtta closed 7 years ago
Seeing exactly the same thing (version 3.0.14). More details here #92
Happened again, table repair_run
shows one coordinator being null
92165cf0-631d-11e7-a844-dbfd17b7d833 | 9216d229-631d-11e7-a844-dbfd17b7d833 | no cause specified | cassandracluster | 2017-07-07 14:07:11+0000 | 2017-07-07 14:07:14+0000 | 0.9 | Postponed a segment because no coordinator was reachable | Stefano | 2017-07-07 14:07:14+0000 | parallel | 9206a580-631d-11e7-a844-dbfd17b7d833 | 13 | 2017-07-07 14:07:14+0000 | RUNNING | null | -6403394277111986699 | 91 | 2017-07-07 14:07:14+0000 | null | 0 | -6414212751690768970
@adejanovski, I am trying to understand why that might happen in the context of incremental repairs.
During my tests, the coordinator is never null when starting the repair, but only after a number of steps have been completed (and necessarily, after a segment has been postponed at least once).
Based on what I see SegmentRunner.postpone
should never set the coordinator to null
when postponing a segment, so I am a bit lost.
Ideas where to look further?
Using incremental repairs, we should indeed never ever set the coordinator to null so if that happens then there's a code path that still allows to null it.
I'll inspect the code shortly and come up with a proper patch.
Cool! Let me know the branch and I will test it right away. Thx!
@adejanovski , did you manage to give it a look by any chance? Thx a lot!
Hi @ostefano,
sorry for the time it took but I was able to reproduce and fix the issue. I've created PR #146 with the fix.
If a segment cannot get repaired within the timeout, abort() is called but fails to provide the RepairUnit : https://github.com/thelastpickle/cassandra-reaper/blob/master/src/main/java/com/spotify/reaper/service/SegmentRunner.java#L123
The PR provides the RepairUnit to abort() which then detects it's an incremental repair and doesn't void coordinator_host
no more.
Could you test the branch and tell us if it works ?
Thanks
Hi @adejanovski, thx a lot!
I have been testing ft-reaper-improvements-final
in the meanwhile. Do you think I can just cherry-pick that commit and run ft-reaper-improvements-final + PR 146?
Thanks
Hi @ostefano,
yes totally, and we'll soon rebase ft-reaper-improvements-final
over master
anyway.
Also, I've just recently added proper support for incremental repair in ft-reaper-improvements-final
when running multiple reaper instances.
Hi @adejanovski , been testing the patch and all seems good. Thx for fixing this!
Hi folks,
we sometimes get error message "Postponed a segment because no coordinator was reachable" when using imcremental repairs in our cassandra 3.10 production cluster.
the repair does not recover from that point. we have to stop the incremental repair and start a new one .. the new repair then normally works without any issues
when having this error we can see the following in the creaper log
DEBUG [2017-05-15 07:34:05,770] [productioncluster:93:61445] c.s.r.c.JmxConnectionFactory - Unreachable host com.spotify.reaper.ReaperException: Null host given to JmxProxy.connect() at com.spotify.reaper.cassandra.JmxProxy.connect(JmxProxy.java:110) ~[creaper.jar:0.5.1-SNAPSHOT] at com.spotify.reaper.cassandra.JmxConnectionFactory.connect(JmxConnectionFactory.java:50) ~[creaper.jar:0.5.1-SNAPSHOT] at com.spotify.reaper.cassandra.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:69) ~[creaper.jar:0.5.1-SNAPSHOT] at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:148) [creaper.jar:0.5.1-SNAPSHOT] at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:93) [creaper.jar:0.5.1-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77] WARN [2017-05-15 07:34:05,770] [productioncluster:93:61445] c.s.r.s.SegmentRunner - Failed to connect to a coordinator node for segment 61445 com.spotify.reaper.ReaperException: no host could be reached through JMX at com.spotify.reaper.cassandra.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:75) ~[creaper.jar:0.5.1-SNAPSHOT] at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:148) [creaper.jar:0.5.1-SNAPSHOT] at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:93) [creaper.jar:0.5.1-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
according to nodetool status all cluster nodes are in state up/normal
br, roland