thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
487 stars 217 forks source link

Repair stopped with error Segment is faulty, replica set changed since repair was started #1220

Closed MrEgenius closed 1 year ago

MrEgenius commented 2 years ago

Hello!

Reaper version - 2.2.4 Backend h2 Cassandra version - 3.11.3 We have cluster with 2 active and 1 backup racks with 82 nodes in each rack = 246 nodes in sum ~100TB in 1 rack = 300TB in sum

Repair can't complete due to an error cassandra_reaper | WARN [2022-08-17 13:40:27,980] [prodcluster:00000000-0000-01a4-0000-000000000000] i.c.s.RepairRunner - Segment #00000000-0053-9d13-0000-000000000000 is faulty, replica set changed since repair was started: io.cassandrareaper.core.Segment@3da7838a

Reaper restart has no effect. Please, help to solve problem.

┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1741 ┆priority: Medium

yvester commented 2 years ago

Hello, I am getting the same type of error: Segment #1f3c64e6-2379-11ed-ac16-854a9b7a5a6c is faulty, replica set changed since repair was started: io.cassandrareaper.core.Segment@7392f023

Full excerpt of the log :

1182274881342157,-374700411561949693] for keyspace "adaptateurtrieurs" on host xxx.xxx;xxx, with repair parallelism dc_parallel, in cluster with Cassandra version '3.11.4' (can use DATACENTER_AWARE 'true'), for column families: [evt_donnees_tri_v1, evt_douane_priority_v1, evtann, evtpch, evtpch_v2] INFO [2022-08-24 10:22:40,565] [ar-cassandra3-preproduction:1f30a4d0-2379-11ed-ac16-854a9b7a5a6c:1f5c21d1-2379-11ed-ac16-854a9b7a5a6c] i.c.j.JmxProxy - Triggering repair for ranges -381182274881342157:-374700411561949693 INFO [2022-08-24 10:22:40,570] [ar-cassandra3-preproduction:1f30a4d0-2379-11ed-ac16-854a9b7a5a6c:1f5c21d1-2379-11ed-ac16-854a9b7a5a6c] i.c.s.RepairRunner - Triggered repair of segment 1f5c21d1-2379-11ed-ac16-854a9b7a5a6c via host 10.155.4.255 INFO [2022-08-24 10:22:40,570] [ar-cassandra3-preproduction:1f30a4d0-2379-11ed-ac16-854a9b7a5a6c:1f5c21d1-2379-11ed-ac16-854a9b7a5a6c] i.c.s.SegmentRunner - Repair for segment 1f5c21d1-2379-11ed-ac16-854a9b7a5a6c started, status wait will timeout in 1800000 millis INFO [2022-08-24 10:22:46,976] [ar-cassandra3-preproduction:1f30a4d0-2379-11ed-ac16-854a9b7a5a6c:1f5c21d1-2379-11ed-ac16-854a9b7a5a6c] i.c.s.SegmentRunner - Repair command 73 on segment 1f5c21d1-2379-11ed-ac16-854a9b7a5a6c returned with state DONE INFO [2022-08-24 10:23:10,570] [ar-cassandra3-preproduction:1f30a4d0-2379-11ed-ac16-854a9b7a5a6c] i.c.s.RepairRunner - Attempting to run new segment... INFO [2022-08-24 10:23:11,255] [ar-cassandra3-preproduction:1f30a4d0-2379-11ed-ac16-854a9b7a5a6c] i.c.s.RepairRunner - Next segment to run : 1f3c64e6-2379-11ed-ac16-854a9b7a5a6c WARN [2022-08-24 10:23:11,265] [ar-cassandra3-preproduction:1f30a4d0-2379-11ed-ac16-854a9b7a5a6c] i.c.s.RepairRunner - Segment #1f3c64e6-2379-11ed-ac16-854a9b7a5a6c is faulty, replica set changed since repair was started: io.cassandrareaper.core.Segment@7392f023**** INFO [2022-08-24 10:23:11,369] [ar-cassandra3-preproduction:1f30a4d0-2379-11ed-ac16-854a9b7a5a6c] i.c.s.RepairRunner - Repair amount done 86.0

Please let us know if there is a known explanation to this issue

yvester commented 2 years ago

Hello, Looking at the code, if i am not mistaken, it appears that the exception has been handled differently in reaper version 3.2.0. https://github.com/thelastpickle/cassandra-reaper/commit/ac58595266ea426bc52c7abef969fb5a189b72a7?diff=split I am going to retest with this new version. Regards,

MrEgenius commented 2 years ago

@yvester

Hello, Looking at the code, if i am not mistaken, it appears that the exception has been handled differently in reaper version 3.2.0. ac58595?diff=split I am going to retest with this new version. Regards,

Have you tested version 3.2.0 ? Does the bug reproduce on this version?

yvester commented 2 years ago

@MrEgenius , This issue is no longer present with the current version. Regards