spotify / cassandra-reaper

Software to run automated repairs of cassandra
235 stars 60 forks source link

Entire repair run moves to ERROR state on exceptions in JMX methods #89

Closed ahenry closed 9 years ago

ahenry commented 9 years ago

I've had several repair runs fail due to transient errors (usually involving a node going up or down, but now always). The exceptions are thrown by JMX remote calls. For example:

ERROR [2015-03-30 22:37:34,539] com.spotify.reaper.service.RepairRunner: RepairRun FAILURE
ERROR [2015-03-30 22:37:34,540] com.spotify.reaper.service.RepairRunner: java.lang.reflect.UndeclaredThrowableException
ERROR [2015-03-30 22:37:34,540] com.spotify.reaper.service.RepairRunner: [com.sun.proxy.$Proxy59.forceTerminateAllRepairSessions(Unknown Source), com.spotify.reaper.cassandra.JmxProxy.cancelAllRepairs(JmxProxy.java:265), com.spotify.reaper.service.SegmentRunner.abort(SegmentRunner.java:89), com.spotify.reaper.service.SegmentRunner.abort(SegmentRunner.java:212), com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:150), com.spotify.reaper.service.SegmentRunner.triggerRepair(SegmentRunner.java:70), com.spotify.reaper.service.RepairRunner.repairSegment(RepairRunner.java:202), com.spotify.reaper.service.RepairRunner.startNextSegment(RepairRunner.java:156), com.spotify.reaper.service.RepairRunner.run(RepairRunner.java:89), java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471), java.util.concurrent.FutureTask.run(FutureTask.java:262), java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178), java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292), java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615), java.lang.Thread.run(Thread.java:745)]

and

ERROR [2015-04-08 21:19:01,950] com.spotify.reaper.service.RepairRunner: RepairRun FAILURE
ERROR [2015-04-08 21:19:01,950] com.spotify.reaper.service.RepairRunner: java.lang.reflect.UndeclaredThrowableException
ERROR [2015-04-08 21:19:01,950] com.spotify.reaper.service.RepairRunner: [com.sun.proxy.$Proxy60.getPendingTasks(Unknown Source), com.spotify.reaper.cassandra.JmxProxy.getPendingCompactions(JmxProxy.java:232), com.spotify.reaper.service.SegmentRunner.canRepair(SegmentRunner.java:177), com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:105), com.spotify.reaper.service.SegmentRunner.triggerRepair(SegmentRunner.java:70), com.spotify.reaper.service.RepairRunner.repairSegment(RepairRunner.java:202), com.spotify.reaper.service.RepairRunner.startNextSegment(RepairRunner.java:156), com.spotify.reaper.service.RepairRunner.run(RepairRunner.java:89), java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471), java.util.concurrent.FutureTask.run(FutureTask.java:262), java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178), java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292), java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615), java.lang.Thread.run(Thread.java:745)]

I've also seen an exception (don't have the logs, sorry) thrown by tokenRangeToEndpoint.

Following the style of ed21152e20e3c8a2bd7923b8b7ebaeb9c73755cd, I wrote a patch that catches RuntimeException in a couple of places in SegmentRunner.canRepair()

Are you interested in a pull request, a patch file, or something else?

Yarin78 commented 9 years ago

Hi,

A PR is fine!

I think I fixed this one place recently (see https://github.com/spotify/cassandra-reaper/commit/ed21152e20e3c8a2bd7923b8b7ebaeb9c73755cd).

rzvoncek commented 9 years ago

After merging #91, can we close this issue?

ahenry commented 9 years ago

Yeah, that will take care of all of the issues I've observed so far. Thanks!