thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
490 stars 218 forks source link

Duplicate repairs getting scheduled. #591

Closed powellchristoph closed 5 years ago

powellchristoph commented 6 years ago

Greetings everyone,

Thank you for the great work on this project. It is greatly appreciated.

Somehow I seem to getting duplicate repairs scheduled and run for the same keyspace. This is obviously creating contention as both repairs block each other. Also, once they start, I am unable to remove either one of them. This is the second time that it has happened on two different keyspaces.

I am running (2) reapers in a multi-region setup with replicated cassandra backend.

cassandra-reaper version: 1.3.0 java version "1.8.0_181" OS: Ubuntu 16.04.5 LTS

curl -s localhost:8080/repair_schedule/ca773da0-b5f4-11e8-bbac-5fdbb7ebf105 | jq .
{
  "id": "ca773da0-b5f4-11e8-bbac-5fdbb7ebf105",
  "owner": "blackbook",
  "state": "ACTIVE",
  "intensity": 0.8999999761581421,
  "cluster_name": "aux1",
  "keyspace_name": "blackbook",
  "column_families": [],
  "incremental_repair": false,
  "segment_count": 0,
  "repair_parallelism": "DATACENTER_AWARE",
  "scheduled_days_between": 8,
  "nodes": [],
  "datacenters": [],
  "blacklisted_tables": [],
  "segment_count_per_node": 0,
  "repair_thread_count": 1,
  "next_activation": "2018-12-08T18:50:00Z",
  "creation_time": "2018-09-11T18:59:20Z",
  "pause_time": "2018-11-21T16:57:51Z"
}
curl -s localhost:8080/repair_run | jq '.[] | select( .state | contains("RUNNING")) | select( .owner | contains("blackbook"))'
{
  "cause": "scheduled run (schedule id ca773da0-b5f4-11e8-bbac-5fdbb7ebf105)",
  "owner": "blackbook",
  "id": "d191d350-f4d0-11e8-bd85-e111dadad660",
  "state": "RUNNING",
  "intensity": 0.8999999761581421,
  "duration": "2 hours 24 minutes 49 seconds",
  "cluster_name": "aux1",
  "column_families": [],
  "keyspace_name": "blackbook",
  "incremental_repair": false,
  "total_segments": 10752,
  "repair_parallelism": "DATACENTER_AWARE",
  "segments_repaired": 219,
  "last_event": "Postponed due to affected hosts already doing repairs",
  "nodes": [],
  "datacenters": [],
  "blacklisted_tables": [],
  "repair_thread_count": 1,
  "creation_time": "2018-11-30T18:50:34Z",
  "start_time": "2018-11-30T18:50:34Z",
  "end_time": "2018-11-30T18:52:34Z",
  "pause_time": "2018-11-30T18:52:34Z",
  "current_time": "2018-11-30T21:15:23Z",
  "estimated_time_of_arrival": "2018-12-05T17:20:41Z"
}
{
  "cause": "scheduled run (schedule id ca773da0-b5f4-11e8-bbac-5fdbb7ebf105)",
  "owner": "blackbook",
  "id": "d22dec40-f4d0-11e8-a415-b5e61892c678",
  "state": "RUNNING",
  "intensity": 0.8999999761581421,
  "duration": "2 hours 24 minutes 48 seconds",
  "cluster_name": "aux1",
  "column_families": [],
  "keyspace_name": "blackbook",
  "incremental_repair": false,
  "total_segments": 10752,
  "repair_parallelism": "DATACENTER_AWARE",
  "segments_repaired": 202,
  "last_event": "Triggered repair of segment d22f733c-f4d0-11e8-a415-b5e61892c678 via host 34.216.185.148",
  "nodes": [],
  "datacenters": [],
  "blacklisted_tables": [],
  "repair_thread_count": 1,
  "creation_time": "2018-11-30T18:50:35Z",
  "start_time": "2018-11-30T18:50:35Z",
  "end_time": "2018-11-30T18:50:35Z",
  "pause_time": "2018-11-30T18:50:35Z",
  "current_time": "2018-11-30T21:15:24Z",
  "estimated_time_of_arrival": "2018-12-06T03:18:20Z"
}
Nov 30 21:16:13 cassandra-reaper-n01 cassandra-reaper[1496]: INFO   [2018-11-30 21:16:13,516] [dw-151148 - PUT /repair_run/d191d350-f4d0-11e8-bd85-e111dadad660/state/PAUSED] i.c.r.RepairRunResource - modify repair run state called with: id = d191d350-f4d0-11e8-bd85-e111dadad660, state = Optional[PAUSED]
Nov 30 21:16:13 cassandra-reaper-n01 cassandra-reaper[1496]: ERROR  [2018-11-30 21:16:13,757] [dw-151148 - PUT /repair_run/d191d350-f4d0-11e8-bd85-e111dadad660/state/PAUSED] i.c.r.RepairRunResource - repair unit already has run ca63dcb0-b5f4-11e8-bbac-5fdbb7ebf105 in RUNNING state

Nov 30 21:18:06 cassandra-reaper-n01 cassandra-reaper[1496]: INFO   [2018-11-30 21:18:06,717] [dw-151098 - PUT /repair_run/d191d350-f4d0-11e8-bd85-e111dadad660/state/ABORTED] i.c.r.RepairRunResource - modify repair run state called with: id = d191d350-f4d0-11e8-bd85-e111dadad660, state = Optional[ABORTED]
Nov 30 21:18:07 cassandra-reaper-n01 cassandra-reaper[1496]: ERROR  [2018-11-30 21:18:06,999] [dw-151098 - PUT /repair_run/d191d350-f4d0-11e8-bd85-e111dadad660/state/ABORTED] i.c.r.RepairRunResource - repair unit already has run ca63dcb0-b5f4-11e8-bbac-5fdbb7ebf105 in RUNNING state
Nov 30 21:16:43 cassandra-reaper-n01 cassandra-reaper[1496]: INFO   [2018-11-30 21:16:43,689] [dw-151121 - PUT /repair_run/d22dec40-f4d0-11e8-a415-b5e61892c678/state/PAUSED] i.c.r.RepairRunResource - modify repair run state called with: id = d22dec40-f4d0-11e8-a415-b5e61892c678, state = Optional[PAUSED]
Nov 30 21:16:44 cassandra-reaper-n01 cassandra-reaper[1496]: ERROR  [2018-11-30 21:16:44,016] [dw-151121 - PUT /repair_run/d22dec40-f4d0-11e8-a415-b5e61892c678/state/PAUSED] i.c.r.RepairRunResource - repair unit already has run ca63dcb0-b5f4-11e8-bbac-5fdbb7ebf105 in RUNNING state

Nov 30 21:17:58 cassandra-reaper-n01 cassandra-reaper[1496]: INFO   [2018-11-30 21:17:58,947] [dw-151162 - PUT /repair_run/d22dec40-f4d0-11e8-a415-b5e61892c678/state/ABORTED] i.c.r.RepairRunResource - modify repair run state called with: id = d22dec40-f4d0-11e8-a415-b5e61892c678, state = Optional[ABORTED]
Nov 30 21:17:59 cassandra-reaper-n01 cassandra-reaper[1496]: ERROR  [2018-11-30 21:17:59,225] [dw-151162 - PUT /repair_run/d22dec40-f4d0-11e8-a415-b5e61892c678/state/ABORTED] i.c.r.RepairRunResource - repair unit already has run ca63dcb0-b5f4-11e8-bbac-5fdbb7ebf105 in RUNNING state

cassandra-reaper.yaml from one region

# Cassandra Reaper Configuration 
segmentCountPerNode: 16
repairParallelism: DATACENTER_AWARE
repairIntensity: 0.9
scheduleDaysBetween: 1
repairRunThreadCount: 15
hangingRepairTimeoutMins: 300
storageType: cassandra
enableCrossOrigin: true
incrementalRepair: false
enableDynamicSeedList: true
repairManagerSchedulingIntervalSeconds: 10
activateQueryLogger: false
jmxConnectionTimeoutInSeconds: 5
useAddressTranslator: false

datacenterAvailability: EACH

cassandra:
  clusterName: "Aux 1"
  contactPoints: ["cassandra-n01","cassandra-n02","cassandra-n03"]
  keyspace: reaper_db
  socketOptions:
    readTimeoutMillis: 20000
  queryOptions:
    consistencyLevel: LOCAL_QUORUM
    serialConsistencyLevel: SERIAL
  loadBalancingPolicy:
    type: tokenAware
    shuffleReplicas: true
    subPolicy:
      type: dcAwareRoundRobin
      localDC: us-east-1
      usedHostsPerRemoteDC: 0
      allowRemoteDCsForLocalConsistencyLevel: false

The reaper_db keyspace is replicated appropriately.

cqlsh> desc keyspace reaper_db;

CREATE KEYSPACE reaper_db WITH replication = {'class': 'NetworkTopologyStrategy', 'us-east-1': '3', 'us-west-2': '3'}  AND durable_writes = true;
michaelsembwever commented 5 years ago

Thanks for a great bug report. I can see two issues here.

  1. Two repairs created and started from the one schedule, but a second apart.
  2. Not being able to delete the duplicate.

Let me look into it a bit more @powellchristoph (Note that Reaper's distributed mode is designed as at-least-once, and the repairs running twice are ok apart from the extra and unnecessary load they're creating…)

michaelsembwever commented 5 years ago

@powellchristoph could you provide the log files from both Reaper's around the time: 2018-11-30T18:50:34Z I'm specifically curious about the logging from the SchedulingManager.

powellchristoph commented 5 years ago

I am sorry. I don't have the logs going that far back.

michaelsembwever commented 5 years ago

I have a fix in https://github.com/thelastpickle/cassandra-reaper/pull/593 It would be awesome if you could deploy and test it. Otherwise if it happens again those requested logs would be very useful.

feroz552 commented 4 years ago

We have encountered this problem today and as mentioned, the problem doesnt seem to be fix in reaper version 1.4.0/

We have schedule repair run, From attached snippets, you can see we have had duplicate jobs to run a repair for a KS. one succeeded and other was just hanging there until we notice a major consistency issue (after 4 weeks) due to the next consecutive jobs failing due to this hung job sitting there.

reaper-2020-09-03.log:INFO [2020-09-03 14:45:51,783] [SchedulingManagerTimer] i.c.s.SchedulingManager - there is repair (id b655ae40-e6e1-11ea-abbe-dbde9584707b) in state 'NOT_STARTED' for repair unit '40dda9c0-3444-11e9-960d-297ac0e3ef4a', postponing current schedule trigger until next scheduling

reaper-2020-09-12.log:INFO [2020-09-12 14:45:39,041] [SchedulingManagerTimer] i.c.s.SchedulingManager - there is repair (id b655ae40-e6e1-11ea-abbe-dbde9584707b) in state 'NOT_STARTED' for repair unit '40dda9c0-3444-11e9-960d-297ac0e3ef4a', postponing current schedule trigger until next scheduling

image

image