thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
485 stars 217 forks source link

Badly Configured Keyspaces Corrupt The Scheduler #202

Open shalomsagges opened 7 years ago

shalomsagges commented 7 years ago

Project board link

I came upon a scenario where a badly configured keyspace (in this case, it was a typo in the DC name) caused the scheduler to skip all other schedules, even on completely different clusters. In the scheduler window, the Next Run column showed the next run would be x hours/days ago

https://mail.google.com/mail/u/0/?ui=2&ik=391736b990&view=fimg&th=15e7af570ec9b7d6&attid=0.1&disp=emb&realattid=ii_15e7af0d3878e0f0&attbid=ANGjdJ80xvpMxrqMKDjL2oXhZvACKhx8WU4GR8PW6ZAaKBIOORwXdfVB-GtFIXkVc1xCsx91tdv0UdA5Z-H_FGbA7Wofoe7BAn-Ru-Yx7pWrEPACD9bRJ8NH9_pe2Wc&sz=w2112-h408&ats=1505822085249&rm=15e7af570ec9b7d6&zw&atsh=1

In the reaper logs, I see the following error: ERROR [2017-09-13 17:35:26,898] [SchedulingManagerTimer] c.s.r.s.SchedulingManager - failed managing schedule for run with id: d0175a80-95ff-11e7-bb1a-0b6d8c0d2134 ERROR [2017-09-13 17:35:26,898] [SchedulingManagerTimer] c.s.r.s.SchedulingManager - catch exception java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at com.spotify.reaper.resources.CommonTools.getClusterNodes(CommonTools.java:211) at com.spotify.reaper.resources.CommonTools.registerRepairRun(CommonTools.java:71) at com.spotify.reaper.service.SchedulingManager.startNewRunForUnit(SchedulingManager.java:173) at com.spotify.reaper.service.SchedulingManager.manageSchedule(SchedulingManager.java:139) at com.spotify.reaper.service.SchedulingManager.run(SchedulingManager.java:83) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505)

After I removed the problematic keyspace and recreated the scheduler the issue was resolved.

So in order to mitigate the issue, would it be possible for the scheduler to mark and skip problematic keyspaces without affecting the other schedules?

Thanks!

┆Issue is synchronized with this Jira Story by Unito

adejanovski commented 5 years ago

Hi @shalomsagges,

are you still affected by this issue?

shalomsagges commented 5 years ago

Hi Alex,

Apologies for the late reply, just saw your message.

Yes, I do experience this issue from time to time. I got the same error a couple of weeks ago. Here's the error (I've added some info lines shown prior to that error:

INFO [2019-03-27 16:55:12,402] [apac-new-cluster:454bf7b0-5090-11e9-8f7c-ad64bf35605e] i.c.s.RepairRunner

Hope this helps.

Thanks!

On Fri, Mar 29, 2019 at 9:47 AM Alexander Dejanovski < notifications@github.com> wrote:

Hi @shalomsagges https://github.com/shalomsagges,

are you still affected by this issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thelastpickle/cassandra-reaper/issues/202#issuecomment-477888325, or mute the thread https://github.com/notifications/unsubscribe-auth/AenDeq8BLUiHATy8T2NhNF-GeoqUVqEUks5vbbcHgaJpZM4PcSI4 .