Badly Configured Keyspaces Corrupt The Scheduler

shalomsagges commented 7 years ago

I came upon a scenario where a badly configured keyspace (in this case, it was a typo in the DC name) caused the scheduler to skip all other schedules, even on completely different clusters. In the scheduler window, the Next Run column showed the next run would be x hours/days ago

https://mail.google.com/mail/u/0/?ui=2&ik=391736b990&view=fimg&th=15e7af570ec9b7d6&attid=0.1&disp=emb&realattid=ii_15e7af0d3878e0f0&attbid=ANGjdJ80xvpMxrqMKDjL2oXhZvACKhx8WU4GR8PW6ZAaKBIOORwXdfVB-GtFIXkVc1xCsx91tdv0UdA5Z-H_FGbA7Wofoe7BAn-Ru-Yx7pWrEPACD9bRJ8NH9_pe2Wc&sz=w2112-h408&ats=1505822085249&rm=15e7af570ec9b7d6&zw&atsh=1

In the reaper logs, I see the following error: ERROR [2017-09-13 17:35:26,898] [SchedulingManagerTimer] c.s.r.s.SchedulingManager - failed managing schedule for run with id: d0175a80-95ff-11e7-bb1a-0b6d8c0d2134 ERROR [2017-09-13 17:35:26,898] [SchedulingManagerTimer] c.s.r.s.SchedulingManager - catch exception java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at com.spotify.reaper.resources.CommonTools.getClusterNodes(CommonTools.java:211) at com.spotify.reaper.resources.CommonTools.registerRepairRun(CommonTools.java:71) at com.spotify.reaper.service.SchedulingManager.startNewRunForUnit(SchedulingManager.java:173) at com.spotify.reaper.service.SchedulingManager.manageSchedule(SchedulingManager.java:139) at com.spotify.reaper.service.SchedulingManager.run(SchedulingManager.java:83) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505)

After I removed the problematic keyspace and recreated the scheduler the issue was resolved.

So in order to mitigate the issue, would it be possible for the scheduler to mark and skip problematic keyspaces without affecting the other schedules?

Thanks!

┆Issue is synchronized with this Jira Story by Unito

adejanovski commented 5 years ago

Hi @shalomsagges,

are you still affected by this issue?

shalomsagges commented 5 years ago

Hi Alex,

Apologies for the late reply, just saw your message.

Yes, I do experience this issue from time to time. I got the same error a couple of weeks ago. Here's the error (I've added some info lines shown prior to that error:

INFO [2019-03-27 16:55:12,402] [apac-new-cluster:454bf7b0-5090-11e9-8f7c-ad64bf35605e] i.c.s.RepairRunner

Repair amount done 3072.0 INFO [2019-03-27 16:55:12,402] [apac-new-cluster:454bf7b0-5090-11e9-8f7c-ad64bf35605e] i.c.s.RepairRunner
Repairs for repair run #454bf7b0-5090-11e9-8f7c-ad64bf35605e done INFO [2019-03-28 02:00:19,863] [SchedulingManagerTimer] i.c.s.SchedulingManager - repair unit '3372cf60-5388-11e8-aab6-9d29436e9fff' should be repaired based on RepairSchedule with id '337910f0-5388-11e8-aab6-9d29436e9fff' INFO [2019-03-28 02:00:23,523] [SchedulingManagerTimer] i.c.s.SegmentGenerator - Dividing token range [-9222807929782021608,-9220477744443608599) into 1 segments INFO [2019-03-28 02:00:23,523] [SchedulingManagerTimer] i.c.s.SegmentGenerator - Dividing token range [-9220477744443608599,-9220351271168066047) into 1 segments INFO [2019-03-28 02:00:23,523] [SchedulingManagerTimer] i.c.s.SegmentGenerator - Dividing token range [-9220351271168066047,-9214212501740770064) into 1 segments INFO [2019-03-28 02:00:23,523] [SchedulingManagerTimer] i.c.s.SegmentGenerator - Dividing token range [-9214212501740770064,-9213108272269290458) into 1 segments INFO [2019-03-28 02:00:23,523] [SchedulingManagerTimer] i.c.s.SegmentGenerator - Dividing token range [-9213108272269290458,-9213037253608362435) into 1 segments INFO [2019-03-28 02:00:23,523] [SchedulingManagerTimer] i.c.s.SegmentGenerator - Dividing token range [-9213037253608362435,-9204297740402592459) into 1 segments INFO [2019-03-28 02:00:23,523] [SchedulingManagerTimer] i.c.s.SegmentGenerator - Dividing token range [-9204297740402592459,-9201624793629377773) into 1 segments INFO [2019-03-28 02:00:23,523] [SchedulingManagerTimer] i.c.s.SegmentGenerator - Dividing token range [-9201624793629377773,-9200038894730524241) into 1 segments INFO [2019-03-28 02:00:23,523] [SchedulingManagerTimer] i.c.s.SegmentGenerator - Dividing token range [-9200038894730524241,-9199119758643915141) into 1 segments INFO [2019-03-28 02:00:23,523] [SchedulingManagerTimer] i.c.s.SegmentGenerator - Dividing token range [-9199119758643915141,-9193671437918332775) into 1 segments ... ... ... INFO [2019-03-28 02:00:23,553] [SchedulingManagerTimer] i.c.s.SegmentGenerator - Dividing token range [9222160023132929882,-9222807929782021608) into 1 segments INFO [2019-03-28 02:01:03,382] [SchedulingManagerTimer] i.c.s.RepairManager - Starting a run with id
de5e46e0-511e-11e9-8f7c-ad64bf35605e with current state 'NOT_STARTED'

INFO [2019-03-28 02:01:03,383] [SchedulingManagerTimer] i.c.s.RepairManager - scheduling repair for repair run

de5e46e0-511e-11e9-8f7c-ad64bf35605e

INFO [2019-03-28 02:01:04,518] [pool-1-thread-1] i.c.s.RepairManager - Restarting run id de5e46e0-511e-11e9-8f7c-ad64bf35605e that has no runner INFO [2019-03-28 02:01:04,518] [pool-1-thread-1] i.c.s.RepairManager - Starting a run with id #de5e46e0-511e-11e9-8f7c-ad64bf35605e with current state 'RUNNING' INFO [2019-03-28 02:01:04,518] [pool-1-thread-1] i.c.s.RepairManager - re-trigger a running run after restart, with id de5e46e0-511e-11e9-8f7c-ad64bf35605e INFO [2019-03-28 02:01:04,518] [pool-1-thread-1] i.c.s.RepairManager - scheduling repair for repair run #de5e46e0-511e-11e9-8f7c-ad64bf35605e INFO [2019-03-28 02:01:08,748] [csds-cluster-ga:de5e46e0-511e-11e9-8f7c-ad64bf35605e] i.c.s.RepairRunner - Running segment for range (-9222807929782021608,-9222807929782021608] INFO [2019-03-28 02:01:08,835] [csds-cluster-ga:de5e46e0-511e-11e9-8f7c-ad64bf35605e] i.c.s.RepairRunner - Next segment to run : de609156-511e-11e9-8f7c-ad64bf35605e INFO [2019-03-28 02:01:22,015] [pool-8-thread-10] i.c.j.JmxConnectionFactory - Adding new JMX Proxy for host y.y.y.y INFO [2019-03-28 02:01:26,630] [csds-cluster-ga:de5e46e0-511e-11e9-8f7c-ad64bf35605e:de609156-511e-11e9-8f7c-ad64bf35605e] i.c.s.SegmentRunner - It is ok to repair segment 'de609156-511e-11e9-8f7c-ad64bf 35605e' on repair run with id 'de5e46e0-511e-11e9-8f7c-ad64bf35605e' INFO [2019-03-28 02:01:27,351] [csds-cluster-ga:de5e46e0-511e-11e9-8f7c-ad64bf35605e:de609156-511e-11e9-8f7c-ad64bf35605e] i.c.j.JmxProxy - Triggering repair of range (-3033719766291708928,-3032111669302 645812] for keyspace "keyspace1" on host x.x.x.x, with repair parallelism parallel, in cluster with Cassandra version '3.0.12' (can use DATACENTER_AWARE 'true'), for column families: [table1] INFO [2019-03-28 02:01:27,360] [csds-cluster-ga:de5e46e0-511e-11e9-8f7c-ad64bf35605e:de609156-511e-11e9-8f7c-ad64bf35605e] i.c.s.SegmentRunner - Repair for segment de609156-511e-11e9-8f7c-ad64bf35605e st arted, status wait will timeout in 1800000 millis ERROR [2019-03-28 02:01:27,481] [SchedulingManagerTimer] i.c.s.SchedulingManager - failed managing schedule for run with id: 337910f0-5388-11e8-aab6-9d29436e9fff ERROR [2019-03-28 02:01:27,498] [SchedulingManagerTimer] i.c.s.SchedulingManager - catch exception java.lang.IllegalArgumentException: A metric named io.cassandrareaper.service.RepairRunner.repairProgress.csdsclusterga.csds.de5e46e0511e11e98f7cad64bf35605e already exists at com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:91) at io.cassandrareaper.service.RepairRunner.(RepairRunner.java:112) at io.cassandrareaper.service.RepairManager.startRunner(RepairManager.java:287) at io.cassandrareaper.service.RepairManager.startRepairRun(RepairManager.java:241) at io.cassandrareaper.service.SchedulingManager.manageSchedule(SchedulingManager.java:168) at io.cassandrareaper.service.SchedulingManager.run(SchedulingManager.java:97) at java.util.TimerThread.mainLoop(Timer.java:555) at java.util.TimerThread.run(Timer.java:505) ERROR [2019-03-28 02:01:27,498] [SchedulingManagerTimer] i.c.s.SchedulingManager - SchedulingManager failed. Exiting JVM. ERROR [2019-03-28 02:01:28,338] [pool-1-thread-1] i.c.ReaperApplication - Couldn't resume running repair runs io.cassandrareaper.ReaperException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried) at io.cassandrareaper.service.RepairManager.resumeRunningRepairRuns(RepairManager.java:127) at io.cassandrareaper.ReaperApplication.lambda$run$0(ReaperApplication.java:213) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:84) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:37) at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:68) at io.cassandrareaper.storage.CassandraStorage.getClusters(CassandraStorage.java:294) at io.cassandrareaper.storage.CassandraStorage.getRepairRunsWithState(CassandraStorage.java:500) at io.cassandrareaper.service.RepairManager.resumeRunningRepairRuns(RepairManager.java:96) ... 8 common frames omitted

Hope this helps.

Thanks!

On Fri, Mar 29, 2019 at 9:47 AM Alexander Dejanovski < notifications@github.com> wrote:

Hi @shalomsagges https://github.com/shalomsagges,

are you still affected by this issue?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thelastpickle/cassandra-reaper/issues/202#issuecomment-477888325, or mute the thread https://github.com/notifications/unsubscribe-auth/AenDeq8BLUiHATy8T2NhNF-GeoqUVqEUks5vbbcHgaJpZM4PcSI4 .

thelastpickle / cassandra-reaper

Badly Configured Keyspaces Corrupt The Scheduler #202

de5e46e0-511e-11e9-8f7c-ad64bf35605e with current state 'NOT_STARTED'

de5e46e0-511e-11e9-8f7c-ad64bf35605e