Open shalomsagges opened 7 years ago
Hi @shalomsagges,
are you still affected by this issue?
Hi Alex,
Apologies for the late reply, just saw your message.
Yes, I do experience this issue from time to time. I got the same error a couple of weeks ago. Here's the error (I've added some info lines shown prior to that error:
INFO [2019-03-27 16:55:12,402] [apac-new-cluster:454bf7b0-5090-11e9-8f7c-ad64bf35605e] i.c.s.RepairRunner
INFO [2019-03-28 02:01:03,383] [SchedulingManagerTimer] i.c.s.RepairManager - scheduling repair for repair run
INFO [2019-03-28 02:01:04,518] [pool-1-thread-1] i.c.s.RepairManager -
Restarting run id de5e46e0-511e-11e9-8f7c-ad64bf35605e that has no runner
INFO [2019-03-28 02:01:04,518] [pool-1-thread-1] i.c.s.RepairManager -
Starting a run with id #de5e46e0-511e-11e9-8f7c-ad64bf35605e with current
state 'RUNNING'
INFO [2019-03-28 02:01:04,518] [pool-1-thread-1] i.c.s.RepairManager -
re-trigger a running run after restart, with id
de5e46e0-511e-11e9-8f7c-ad64bf35605e
INFO [2019-03-28 02:01:04,518] [pool-1-thread-1] i.c.s.RepairManager -
scheduling repair for repair run #de5e46e0-511e-11e9-8f7c-ad64bf35605e
INFO [2019-03-28 02:01:08,748]
[csds-cluster-ga:de5e46e0-511e-11e9-8f7c-ad64bf35605e] i.c.s.RepairRunner -
Running segment for range (-9222807929782021608,-9222807929782021608]
INFO [2019-03-28 02:01:08,835]
[csds-cluster-ga:de5e46e0-511e-11e9-8f7c-ad64bf35605e] i.c.s.RepairRunner -
Next segment to run : de609156-511e-11e9-8f7c-ad64bf35605e
INFO [2019-03-28 02:01:22,015] [pool-8-thread-10]
i.c.j.JmxConnectionFactory - Adding new JMX Proxy for host y.y.y.y
INFO [2019-03-28 02:01:26,630]
[csds-cluster-ga:de5e46e0-511e-11e9-8f7c-ad64bf35605e:de609156-511e-11e9-8f7c-ad64bf35605e]
i.c.s.SegmentRunner - It is ok to repair segment
'de609156-511e-11e9-8f7c-ad64bf
35605e' on repair run with id 'de5e46e0-511e-11e9-8f7c-ad64bf35605e'
INFO [2019-03-28 02:01:27,351]
[csds-cluster-ga:de5e46e0-511e-11e9-8f7c-ad64bf35605e:de609156-511e-11e9-8f7c-ad64bf35605e]
i.c.j.JmxProxy - Triggering repair of range
(-3033719766291708928,-3032111669302
645812] for keyspace "keyspace1" on host x.x.x.x, with repair parallelism
parallel, in cluster with Cassandra version '3.0.12' (can use
DATACENTER_AWARE 'true'), for column families: [table1]
INFO [2019-03-28 02:01:27,360]
[csds-cluster-ga:de5e46e0-511e-11e9-8f7c-ad64bf35605e:de609156-511e-11e9-8f7c-ad64bf35605e]
i.c.s.SegmentRunner - Repair for segment
de609156-511e-11e9-8f7c-ad64bf35605e st
arted, status wait will timeout in 1800000 millis
ERROR [2019-03-28 02:01:27,481] [SchedulingManagerTimer]
i.c.s.SchedulingManager - failed managing schedule for run with id:
337910f0-5388-11e8-aab6-9d29436e9fff
ERROR [2019-03-28 02:01:27,498] [SchedulingManagerTimer]
i.c.s.SchedulingManager - catch exception
java.lang.IllegalArgumentException: A metric named
io.cassandrareaper.service.RepairRunner.repairProgress.csdsclusterga.csds.de5e46e0511e11e98f7cad64bf35605e
already exists
at
com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91)
at
io.cassandrareaper.service.RepairRunner.
Hope this helps.
Thanks!
On Fri, Mar 29, 2019 at 9:47 AM Alexander Dejanovski < notifications@github.com> wrote:
Hi @shalomsagges https://github.com/shalomsagges,
are you still affected by this issue?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thelastpickle/cassandra-reaper/issues/202#issuecomment-477888325, or mute the thread https://github.com/notifications/unsubscribe-auth/AenDeq8BLUiHATy8T2NhNF-GeoqUVqEUks5vbbcHgaJpZM4PcSI4 .
Project board link
I came upon a scenario where a badly configured keyspace (in this case, it was a typo in the DC name) caused the scheduler to skip all other schedules, even on completely different clusters. In the scheduler window, the Next Run column showed the next run would be x hours/days ago
https://mail.google.com/mail/u/0/?ui=2&ik=391736b990&view=fimg&th=15e7af570ec9b7d6&attid=0.1&disp=emb&realattid=ii_15e7af0d3878e0f0&attbid=ANGjdJ80xvpMxrqMKDjL2oXhZvACKhx8WU4GR8PW6ZAaKBIOORwXdfVB-GtFIXkVc1xCsx91tdv0UdA5Z-H_FGbA7Wofoe7BAn-Ru-Yx7pWrEPACD9bRJ8NH9_pe2Wc&sz=w2112-h408&ats=1505822085249&rm=15e7af570ec9b7d6&zw&atsh=1
In the reaper logs, I see the following error: ERROR [2017-09-13 17:35:26,898] [SchedulingManagerTimer] c.s.r.s.SchedulingManager - failed managing schedule for run with id: d0175a80-95ff-11e7-bb1a-0b6d8c0d2134 ERROR [2017-09-13 17:35:26,898] [SchedulingManagerTimer] c.s.r.s.SchedulingManager - catch exception java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at com.spotify.reaper.resources.CommonTools.getClusterNodes(CommonTools.java:211) at com.spotify.reaper.resources.CommonTools.registerRepairRun(CommonTools.java:71) at com.spotify.reaper.service.SchedulingManager.startNewRunForUnit(SchedulingManager.java:173) at com.spotify.reaper.service.SchedulingManager.manageSchedule(SchedulingManager.java:139) at com.spotify.reaper.service.SchedulingManager.run(SchedulingManager.java:83) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505)
After I removed the problematic keyspace and recreated the scheduler the issue was resolved.
So in order to mitigate the issue, would it be possible for the scheduler to mark and skip problematic keyspaces without affecting the other schedules?
Thanks!
┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: REAP-184