Open lapo-luchini opened 1 year ago
Since the last versions, repairs are now adaptive. Which means a segment timeout will get extended for each retry, in the hope that it will eventually pass. Maybe you have segments that have a high failure count?
Ah, yes! Didn't notice, thanks. Previous rows in the log were indeed:
i.c.s.SegmentRunner - repair session failed for segment with id '271ddf57-de81-11ed-a6d8-41233aeced5a' and repair number '1023'
i.c.s.SegmentRunner - repair session failed for segment with id '271ddf57-de81-11ed-a6d8-41233aeced5a' and repair number '5623'
i.c.s.SegmentRunner - repair session failed for segment with id '271ddf57-de81-11ed-a6d8-41233aeced5a' and repair number '5624'
i.c.s.SegmentRunner - repair session failed for segment with id '271ddf57-de81-11ed-a6d8-41233aeced5a' and repair number '2043'
i.c.s.SegmentRunner - repair session failed for segment with id '271ddf57-de81-11ed-a6d8-41233aeced5a' and repair number '5625'
i.c.s.SegmentRunner - Failed to connect to a coordinator node for segment 271ddf57-de81-11ed-a6d8-41233aeced5a
Is it possible to check current expiration time?
Mhh, I was a bit fast in closing the issue… the UUIDs don't match and the stuck segment has zero failures:
(it's now 15:47, so over 3h after that segment has started)
repairs are now adaptive
Over the single segment, or overall? (I have failures on other segments, not on that ones)
Project board link
Rarely, but sometimes I get segmets to hang for a much longer time than I configured
hangingRepairTimeoutMins
(e.g. 3 days vs a configuration of "120" i.e. 2 hours).I can return to normality by aborting those segments manually (and/or sometimes killing and restarting Reaper) but I wonder if that's only me and/or I did misunderstand anything about that parameter.
Could this be connected to this log?
┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: REAP-62