thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
490 stars 218 forks source link

repairs failing with io.cassandrareaper.ReaperException: java.lang.IllegalStateException: endTime can only be set if segment is DONE #965

Closed guminy closed 2 years ago

guminy commented 4 years ago

We have a 2 Datacenter configuration. Each with 3 Cassandra servers. We moved the Cassandra nodes from one data center to a new data center. These nodes now have new IP addresses, but still have the same hostname.

After this, we get errors in the reaper log file:

ERROR [2020-09-28 00:00:08,743] [pool-1-thread-1] i.c.ReaperApplication - Couldn't resume running repair runs io.cassandrareaper.ReaperException: java.lang.IllegalStateException: endTime can only be set if segment is DONE at io.cassandrareaper.service.RepairManager.resumeRunningRepairRuns(RepairManager.java:127) at io.cassandrareaper.ReaperApplication.lambda$run$0(ReaperApplication.java:213) at io.cassandrareaper.ReaperApplication$$Lambda$78.000000002C3CF610.run(Unknown Source) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:319) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:191) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1160) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.lang.Thread.run(Thread.java:818)

How do I resolve this problem?

┆Issue is synchronized with this Jira Task by Unito ┆Issue Number: K8SSAND-460

adejanovski commented 4 years ago

Hi @guminy,

which version of Reaper are you using? We fixed that issue a little while ago.

To unblock the current situation, you can look for the segment in the repair_run table which has a segment_state not equal to 2 with a non null segment_end_time. You'll need to set segment_end_time and segment_start_time to null, and segment_state to 0.

Let me know how this works.

guminy commented 4 years ago

We're on version 1.1.0. We'll try the suggested fix and see what happens. Thanks.

adejanovski commented 4 years ago

@guminy, heads up you'll need to upgrade to 1.4.x before you can upgrade to the latest 2.0.5.

guminy commented 4 years ago

@adejanovski Yes, I'm aware of the upgrade limitations :(.

We tried the suggestion and it corrected the issue. However we're not confident our client can easily do this due to limitations in CQL queries. What is the impact of truncating this table instead?

adejanovski commented 4 years ago

Truncating the repair_run table will do it. I'll just remove all the past and current repairs. You may want to truncate repair_run_by_unit and repair_run_by_cluster as well.

adejanovski commented 2 years ago

This was fixed a while ago. Closing this ticket.

svalvaikar commented 1 year ago

We saw this issue when upgrading from version 2.3.1 to 3.2.0. Ours is a two data-center(DC) scenario that needs at least one of the DCs to be available at all times for HA. Hence we performed rolling upgrade in the following fashion. DC2 was still operational and the reaper there was running and accessing the cassandra DB store, while DC1 was being upgraded. I suspect that having two different versions of the reaper talk to the same data store would have resulted in this issue.

We fixed the issue by using the workaround you suggested here (https://github.com/thelastpickle/cassandra-reaper/issues/965#issuecomment-700043507). And we tried upgrade again where we ensured that all instances of the reaper were stopped while the upgrade was going on. We haven't seen the issue being reproduced again.

Since rolling upgrade may not be a supported scenario for reaper, we updated our upgrade documents to stop all instances of reaper (which is fine by us unlike cassandra that needs to be up for HA)