Improve capacity to handle removed node ip's

We're getting an fs error which is downing a node. Here's what's happening:

1) Reaper starts, then has internode comms issues:

INFO  [Messaging-EventLoop-3-9] 2022-06-05 21:32:31,203 NoSpamLogger.java:92 - /10.95.48.210:7000->/10.241.52.122:7000-URGENT_MESSAGES-[no-channel] failed to connect
io.netty.channel.ConnectTimeoutException: connection timed out: /10.241.52.122:7000
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:750)

That node was removed from the cluster about 3 weeks ago, but they didn't remove it from the seeds list.

Then I start seeing:

ERROR [RepairJobTask:8] 2022-06-05 21:32:32,534 RepairRunnable.java:178 - Repair e4f371b0-e540-11ec-9169-51ddb6348476 failed:
java.lang.RuntimeException: Repair session e4fb60f0-e540-11ec-9169-51ddb6348476 for range [(-5504181921528839849,-5490870249379662317]] failed with error Could not create snapshot at /10.95.48.210:7000

The repair session fails:

Caused by: java.lang.RuntimeException: Parent repair session with id = e4f371b0-e540-11ec-9169-51ddb6348476 has failed.
    at org.apache.cassandra.service.ActiveRepairService.getParentRepairSession(ActiveRepairService.java:683)
    at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:111)
    ... 10 common frames omitted

Then we get an fs error and the node shutdown begins for cql clients and internode:

Caused by: java.nio.file.DirectoryNotEmptyException: /disk5/c_data/srm/conf_tiers-4b0de4f0709211eb989cd3f7ef2f9a70/snapshots/e4f371b0-e540-11ec-9169-51ddb6348476
    at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:242)
    at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
    at java.nio.file.Files.delete(Files.java:1126)
    at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:250)
    ... 51 common frames omitted
INFO  [Messaging-EventLoop-3-54] 2022-06-05 21:32:32,553 OutboundConnection.java:1150 - /10.95.48.210:7000(/10.95.48.210:54356)->/10.95.48.210:7000-URGENT_MESSAGES-24d6d867 successfully connected, version = 12, framing = CRC, encryption = encrypted(factory=openssl;protocol=TLSv1.2;cipher=TLS_RSA_WITH_AES_128_GCM_SHA256)
INFO  [Messaging-EventLoop-3-67] 2022-06-05 21:32:32,553 InboundConnectionInitiator.java:464 - /10.95.48.210:7000(/10.95.48.210:54356)->/10.95.48.210:7000-URGENT_MESSAGES-c806ff25 messaging connection established, version = 12, framing = CRC, encryption = encrypted(factory=openssl;protocol=TLSv1.2;cipher=TLS_RSA_WITH_AES_128_GCM_SHA256)
WARN  [RepairJobTask:3] 2022-06-05 21:32:32,554 RepairJob.java:169 - [repair #e4fb60f0-e540-11ec-9169-51ddb6348476] srm.conf_applications sync failed
WARN  [RepairJobTask:6] 2022-06-05 21:32:32,554 RepairJob.java:169 - [repair #e4fb60f0-e540-11ec-9169-51ddb6348476] srm.conf_departments sync failed
ERROR [RepairJobTask:8] 2022-06-05 21:32:32,567 DefaultFSErrorHandler.java:64 - Stopping transports as disk_failure_policy is stop
ERROR [RepairJobTask:8] 2022-06-05 21:32:32,567 StorageService.java:453 - Stopping native transport
INFO  [RepairJobTask:8] 2022-06-05 21:32:32,616 Server.java:171 - Stop listening for CQL clients
ERROR [RepairJobTask:8] 2022-06-05 21:32:32,616 StorageService.java:458 - Stopping gossiper

To me, it seems like we're creating the directory for snapshots, it then fails, it then tries to recreate the snapshot in the same directory and that causes the fs error because it's not empty...does that sound plausible?

┆Issue is synchronized with this Jira Story by Unito

thelastpickle / cassandra-reaper

Improve capacity to handle removed node ip's #1200