Open gaborauth opened 1 year ago
Hi @gaborauth,
are these repair runs created by schedules? Is the adaptive feature enabled on these schedules? Could you check the segment timeout on the schedule?
The number of segments or segment timeout can be adjusted if the adaptive feature is enabled, but aside from a sad sad bug that allows the timeout to go into negative values, I'm a little puzzled with this behavior.
Sometimes the affected nodes are failed too and restarted by Kubernetes.
Ouch, that's very weird as well 🤔 Anything else in the logs of these nodes when they are restarted?
are these repair runs created by schedules? Is the adaptive feature enabled on these schedules?
Yes and yes.
Could you check the segment timeout on the schedule?
The parameters of the schedule:
Next run | 20 January 2023 08:39
Owner | auto-scheduling
Incremental | false
Segment count per node | 19
Intensity | 0.8999999761581421
Repair threads | 1
Segment timeout (mins) | -268435456
Repair parallelism | DATACENTER_AWARE
Creation time | 6 November 2022 21:36
Adaptive | true
Anything else in the logs of these nodes when they are restarted?
I have no logs from that period (I paused it on late December). BTW: the repair process creates snapshots per repair session and then left them hang, so, maybe these bunch of snapshots cause some lack of resource and led to restart, I had to delete them manually.
ok, this doesn't seem to be due to the adaptive nature of the schedule. The only things an adaptive schedule can get adjusted is:
We're never lowering the timeout.
I've checked the autoscheduler code, but it'll use the default timeout.
Could you check the timeout configured for the other schedules? Could you modify the timeout for this schedule and monitor if it gets lowered by Reaper itself?
Could you check the timeout configured for the other schedules?
I've just checked some: Segment timeout (mins) | 491520 Segment timeout (mins) | 62914560 Segment timeout (mins) | -268435456 Segment timeout (mins) | 960 Segment timeout (mins) | 240 Segment timeout (mins) | 15360 Segment timeout (mins) | 491520 Segment timeout (mins) | 7680 Segment timeout (mins) | 62914560 Segment timeout (mins) | 61440
Could you modify the timeout for this schedule and monitor if it gets lowered by Reaper itself?
Where can I modify? It's not appeared on the edit dialog. Is it safe to modify directly in the cassandra_reaper.repair_unit_v1 table (timeout field)?
You should have access to the timeout under the Advanced settings section of the edit dialog. Click on "Advanced settings" to make that appear.
actually you're right, I'm not seeing it either 😓 We need to make it accessible. Changing the value in the table directly is safe.
Changing the value in the table directly is safe.
Okay, I've just updated timeout of all repair units to '60', had to restart the Reaper and... it does the magic, looks like working. :)
I will check the timeout values regularly.
We're never lowering the timeout.
The increasing of the timeout is overflow safe?
The increasing of the timeout is overflow safe?
Looking at the numbers you have, I think we should put a max threshold over which Reaper cannot increase it anymore.
I tried the duplication, and the -268435456
is definitely an overflow of the 32 bits wide integer type, so, a reasonable threshold will solve it:
60
120
240
480
960
1920
3840
7680
15360
30720
61440
122880
245760
491520
983040
1966080
3932160
7864320
15728640
31457280
62914560
125829120
251658240
503316480
1006632960
2013265920
-268435456
Project board link
I experienced some weird issue with one of my clusters and only one of keyspace. In this case, all repair segments are failed and the fail count is incrementing constantly. Sometimes the affected nodes are failed too and restarted by Kubernetes.
--
On the Cassandra side, I see exceptions like this:
The CLI repair working flawlessly on the same keyspace (it's super-fast, because it's in one DC and has a few ranges and few data only, so I run the CLI repair from crontab):
So, I don't know, which side has the issue, but it looks like not a Cassandra issue.
Also, for info, I checked the repairs, and the failing repairs has negative segment timeout, the other repairs have positive segment timeout:
Some other information:
┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: REAP-74