We recently ran into a case of a hanging repair run. Some segments kept getting postponed indefinitely, because an involved node reported that it was participating in a repair. We got the repair session's hash from the node's log. Other nodes' logs reported this session as finished, but not that node. So apparently, the cross-communication within Cassandra had failed there.
Reaper on the other hand got notified that the repair was done, so it moved along to remaining segments. But segments within that node's range obviously got stopped by SegmentRunner::canRepair, because a repair was already underway according to the node.
Potential fix: when SegmentRunner::canRepair discovers a node that's already busy with repair, compare with Reaper's storage to determine if it really should have a repair ongoing. If not, use JmxProxy::cancelAllRepairs to clear that node's state.
We recently ran into a case of a hanging repair run. Some segments kept getting postponed indefinitely, because an involved node reported that it was participating in a repair. We got the repair session's hash from the node's log. Other nodes' logs reported this session as finished, but not that node. So apparently, the cross-communication within Cassandra had failed there.
Reaper on the other hand got notified that the repair was done, so it moved along to remaining segments. But segments within that node's range obviously got stopped by SegmentRunner::canRepair, because a repair was already underway according to the node.
Potential fix: when SegmentRunner::canRepair discovers a node that's already busy with repair, compare with Reaper's storage to determine if it really should have a repair ongoing. If not, use JmxProxy::cancelAllRepairs to clear that node's state.