Closed powellchristoph closed 5 years ago
Thanks for a great bug report. I can see two issues here.
Let me look into it a bit more @powellchristoph (Note that Reaper's distributed mode is designed as at-least-once, and the repairs running twice are ok apart from the extra and unnecessary load they're creating…)
@powellchristoph could you provide the log files from both Reaper's around the time: 2018-11-30T18:50:34Z
I'm specifically curious about the logging from the SchedulingManager
.
I am sorry. I don't have the logs going that far back.
I have a fix in https://github.com/thelastpickle/cassandra-reaper/pull/593 It would be awesome if you could deploy and test it. Otherwise if it happens again those requested logs would be very useful.
We have encountered this problem today and as mentioned, the problem doesnt seem to be fix in reaper version 1.4.0/
We have schedule repair run, From attached snippets, you can see we have had duplicate jobs to run a repair for a KS. one succeeded and other was just hanging there until we notice a major consistency issue (after 4 weeks) due to the next consecutive jobs failing due to this hung job sitting there.
reaper-2020-09-03.log:INFO [2020-09-03 14:45:51,783] [SchedulingManagerTimer] i.c.s.SchedulingManager - there is repair (id b655ae40-e6e1-11ea-abbe-dbde9584707b) in state 'NOT_STARTED' for repair unit '40dda9c0-3444-11e9-960d-297ac0e3ef4a', postponing current schedule trigger until next scheduling
reaper-2020-09-12.log:INFO [2020-09-12 14:45:39,041] [SchedulingManagerTimer] i.c.s.SchedulingManager - there is repair (id b655ae40-e6e1-11ea-abbe-dbde9584707b) in state 'NOT_STARTED' for repair unit '40dda9c0-3444-11e9-960d-297ac0e3ef4a', postponing current schedule trigger until next scheduling
Greetings everyone,
Thank you for the great work on this project. It is greatly appreciated.
Somehow I seem to getting duplicate repairs scheduled and run for the same keyspace. This is obviously creating contention as both repairs block each other. Also, once they start, I am unable to remove either one of them. This is the second time that it has happened on two different keyspaces.
I am running (2) reapers in a multi-region setup with replicated cassandra backend.
cassandra-reaper version: 1.3.0 java version "1.8.0_181" OS: Ubuntu 16.04.5 LTS
cassandra-reaper.yaml from one region
The reaper_db keyspace is replicated appropriately.