thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
490 stars 218 forks source link

All repair schedules turned off, but seeing an id: null error #1206

Closed StevenLacerda closed 2 years ago

StevenLacerda commented 2 years ago

Customer has disabled all repairs based on an issue they're seeing with snapshots. In the meantime, while all repairs are disabled, they're getting the following error:

INFO   [2022-06-16 00:09:45,093] [SchedulingManagerTimer] i.c.s.SchedulingManager - Repair schedule '24aa0620-8826-11ec-ba6f-0338dd77ee5b' is paused 
ERROR  [2022-06-16 00:10:45,091] [SchedulingManagerTimer] i.c.s.SchedulingManager - failed managing schedule for run with id: null 
ERROR  [2022-06-16 00:10:45,091] [SchedulingManagerTimer] i.c.s.SchedulingManager - catch exception 
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.rangeCheck(ArrayList.java:659)
    at java.util.ArrayList.get(ArrayList.java:435)
    at io.cassandrareaper.service.SchedulingManager.currentReaperIsSchedulingLeader(SchedulingManager.java:263)
    at io.cassandrareaper.service.SchedulingManager.run(SchedulingManager.java:108)
    at java.util.TimerThread.mainLoop(Timer.java:555)
    at java.util.TimerThread.run(Timer.java:505)
ERROR  [2022-06-16 00:10:45,091] [SchedulingManagerTimer] i.c.s.SchedulingManager - SchedulingManager failed. Exiting JVM. 
INFO   [2022-06-16 00:10:50,715] [JettyShutdownThread] i.c.s.CassandraStorage - Reaper is stopping, removing this instance from running reapers... 

I don't see anything similar in the reported issues, any ideas on what could be causing this? As you can see, the JVM is shutting down due to the error.

┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1590 ┆priority: Medium

adejanovski commented 2 years ago

Ouch, is that happening even after Reaper is restarted? Which version of Reaper are they running?

StevenLacerda commented 2 years ago

Cassandra version 4.0.1 and reaper version 3.1.1.

I'm not sure what you mean by restarting:

1) If they restart the JVM, then yes. 2) If they start repairs again, then no.

adejanovski commented 2 years ago

It's really weird because the error reported here means the running_reapers table is empty, which shouldn't be the case in distributed mode:

  /**
   * When multiple Reapers are running, only the older one can start schedules.
   * In non distributed modes, this method always returns true.
   *
   * @return true or false
   */
  @VisibleForTesting
  boolean currentReaperIsSchedulingLeader() {
    if (context.isDistributed.get()) {
      List<UUID> runningReapers = ((IDistributedStorage) context.storage).getRunningReapers();
      Collections.sort(runningReapers);
      return context.reaperInstanceId.equals(runningReapers.get(0));
    }

    return true;
  }

@StevenLacerda, could you check which datacenterAvailability is configured in the yaml, and what's the content of the running_reapers table? I don't think the problem is related to schedules all being related, the exception suggests otherwise.

StevenLacerda commented 2 years ago

Ok, I'll let you know.

StevenLacerda commented 2 years ago

Running reapers table isn't empty:

cqlsh:reaper_db> select * from running_reapers;

 reaper_instance_id                  | last_heartbeat                 | reaper_instance_host

--------------------------------------+---------------------------------+----------------------

 5a8d1590-f09d-11ec-918f-353c7bec787b | 2022-06-23 21:12:23.446000+0000 |           127.0.0.1

I'm still checking on data center availabilty.

StevenLacerda commented 2 years ago

He has:

datacenterAvailability: ALL

Do you recommend that we change that?

BrandonBordeaux commented 2 years ago

@adejanovski, I'm following up on Steve's behalf. Do you need any additional information to help diagnose this issue? Anything we can do to workaround?

adejanovski commented 2 years ago

Hi @BrandonBordeaux,

I'll push a fix for that issue very shortly. While I don't totally understand what's going on, I can harden the code so that we don't get an error that then exits the JVM.

I was planning to release a new version of Reaper soon, so this fix will make it in.