Closed StevenLacerda closed 2 years ago
Ouch, is that happening even after Reaper is restarted? Which version of Reaper are they running?
Cassandra version 4.0.1 and reaper version 3.1.1.
I'm not sure what you mean by restarting:
1) If they restart the JVM, then yes. 2) If they start repairs again, then no.
It's really weird because the error reported here means the running_reapers
table is empty, which shouldn't be the case in distributed mode:
/**
* When multiple Reapers are running, only the older one can start schedules.
* In non distributed modes, this method always returns true.
*
* @return true or false
*/
@VisibleForTesting
boolean currentReaperIsSchedulingLeader() {
if (context.isDistributed.get()) {
List<UUID> runningReapers = ((IDistributedStorage) context.storage).getRunningReapers();
Collections.sort(runningReapers);
return context.reaperInstanceId.equals(runningReapers.get(0));
}
return true;
}
@StevenLacerda, could you check which datacenterAvailability
is configured in the yaml, and what's the content of the running_reapers
table?
I don't think the problem is related to schedules all being related, the exception suggests otherwise.
Ok, I'll let you know.
Running reapers table isn't empty:
cqlsh:reaper_db> select * from running_reapers;
reaper_instance_id | last_heartbeat | reaper_instance_host
--------------------------------------+---------------------------------+----------------------
5a8d1590-f09d-11ec-918f-353c7bec787b | 2022-06-23 21:12:23.446000+0000 | 127.0.0.1
I'm still checking on data center availabilty.
He has:
datacenterAvailability: ALL
Do you recommend that we change that?
@adejanovski, I'm following up on Steve's behalf. Do you need any additional information to help diagnose this issue? Anything we can do to workaround?
Hi @BrandonBordeaux,
I'll push a fix for that issue very shortly. While I don't totally understand what's going on, I can harden the code so that we don't get an error that then exits the JVM.
I was planning to release a new version of Reaper soon, so this fix will make it in.
Customer has disabled all repairs based on an issue they're seeing with snapshots. In the meantime, while all repairs are disabled, they're getting the following error:
I don't see anything similar in the reported issues, any ideas on what could be causing this? As you can see, the JVM is shutting down due to the error.
┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1590 ┆priority: Medium