thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
487 stars 217 forks source link

Improve capacity to handle removed node ip's #1200

Open StevenLacerda opened 2 years ago

StevenLacerda commented 2 years ago

Project board link

We're getting an fs error which is downing a node. Here's what's happening:

1) Reaper starts, then has internode comms issues:

INFO  [Messaging-EventLoop-3-9] 2022-06-05 21:32:31,203 NoSpamLogger.java:92 - /10.95.48.210:7000->/10.241.52.122:7000-URGENT_MESSAGES-[no-channel] failed to connect
io.netty.channel.ConnectTimeoutException: connection timed out: /10.241.52.122:7000
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576)
    at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:750)

That node was removed from the cluster about 3 weeks ago, but they didn't remove it from the seeds list.

  1. Then I start seeing:
ERROR [RepairJobTask:8] 2022-06-05 21:32:32,534 RepairRunnable.java:178 - Repair e4f371b0-e540-11ec-9169-51ddb6348476 failed:
java.lang.RuntimeException: Repair session e4fb60f0-e540-11ec-9169-51ddb6348476 for range [(-5504181921528839849,-5490870249379662317]] failed with error Could not create snapshot at /10.95.48.210:7000
  1. The repair session fails:
Caused by: java.lang.RuntimeException: Parent repair session with id = e4f371b0-e540-11ec-9169-51ddb6348476 has failed.
    at org.apache.cassandra.service.ActiveRepairService.getParentRepairSession(ActiveRepairService.java:683)
    at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:111)
    ... 10 common frames omitted
  1. Then we get an fs error and the node shutdown begins for cql clients and internode:
Caused by: java.nio.file.DirectoryNotEmptyException: /disk5/c_data/srm/conf_tiers-4b0de4f0709211eb989cd3f7ef2f9a70/snapshots/e4f371b0-e540-11ec-9169-51ddb6348476
    at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:242)
    at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
    at java.nio.file.Files.delete(Files.java:1126)
    at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:250)
    ... 51 common frames omitted
INFO  [Messaging-EventLoop-3-54] 2022-06-05 21:32:32,553 OutboundConnection.java:1150 - /10.95.48.210:7000(/10.95.48.210:54356)->/10.95.48.210:7000-URGENT_MESSAGES-24d6d867 successfully connected, version = 12, framing = CRC, encryption = encrypted(factory=openssl;protocol=TLSv1.2;cipher=TLS_RSA_WITH_AES_128_GCM_SHA256)
INFO  [Messaging-EventLoop-3-67] 2022-06-05 21:32:32,553 InboundConnectionInitiator.java:464 - /10.95.48.210:7000(/10.95.48.210:54356)->/10.95.48.210:7000-URGENT_MESSAGES-c806ff25 messaging connection established, version = 12, framing = CRC, encryption = encrypted(factory=openssl;protocol=TLSv1.2;cipher=TLS_RSA_WITH_AES_128_GCM_SHA256)
WARN  [RepairJobTask:3] 2022-06-05 21:32:32,554 RepairJob.java:169 - [repair #e4fb60f0-e540-11ec-9169-51ddb6348476] srm.conf_applications sync failed
WARN  [RepairJobTask:6] 2022-06-05 21:32:32,554 RepairJob.java:169 - [repair #e4fb60f0-e540-11ec-9169-51ddb6348476] srm.conf_departments sync failed
ERROR [RepairJobTask:8] 2022-06-05 21:32:32,567 DefaultFSErrorHandler.java:64 - Stopping transports as disk_failure_policy is stop
ERROR [RepairJobTask:8] 2022-06-05 21:32:32,567 StorageService.java:453 - Stopping native transport
INFO  [RepairJobTask:8] 2022-06-05 21:32:32,616 Server.java:171 - Stop listening for CQL clients
ERROR [RepairJobTask:8] 2022-06-05 21:32:32,616 StorageService.java:458 - Stopping gossiper

To me, it seems like we're creating the directory for snapshots, it then fails, it then tries to recreate the snapshot in the same directory and that causes the fs error because it's not empty...does that sound plausible?

┆Issue is synchronized with this Jira Story by Unito

adejanovski commented 1 year ago

Hi, I know it's a fairly old ticket and apologize for taking so long to respond. I do not think this can be Reaper related. Reaper does not specify which nodes should be involved in the repair, it just starts a repair session for a list of token ranges through one of the live nodes which then acts as a coordinator. This coordinator is then responsible for contacting the nodes that should be involved in that repair. The internode comms issue is a red herring here I think and the crux of the issue is the inability to create a snapshot, but that's a Cassandra problem I'd say. Snapshots get created by Cassandra automatically when a sequential/dc aware repair is used. Again, Reaper has no control over the name and location of the snapshot.