thelastpickle / cassandra-reaper

Automated Repair Awesomeness for Apache Cassandra
http://cassandra-reaper.io/
Apache License 2.0
490 stars 218 forks source link

Repair failed with error 'Terminate session is called' #1264

Open gaborauth opened 1 year ago

gaborauth commented 1 year ago

Project board link

I experienced some weird issue with one of my clusters and only one of keyspace. In this case, all repair segments are failed and the fail count is incrementing constantly. Sometimes the affected nodes are failed too and restarted by Kubernetes.

--

On the Cassandra side, I see exceptions like this:

[cassandra-dc03-3] ERROR [RMI TCP Connection(1108)-116.203.207.207] 2023-01-13 16:26:00,743 RepairRunnable.java:178 - Repair f7187ee0-935e-11ed-9017-a9a7ecf41662 failed:
[cassandra-dc03-3] java.lang.RuntimeException: Repair session f71bb330-935e-11ed-9017-a9a7ecf41662 for range [(440469675443902358,674085611305865890]] failed with error Terminate session is called
[cassandra-dc03-3]      at org.apache.cassandra.repair.RepairRunnable$RepairSessionCallback.onFailure(RepairRunnable.java:698)
[cassandra-dc03-3]      at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1056)
[cassandra-dc03-3]      at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
[cassandra-dc03-3]      at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1138)
[cassandra-dc03-3]      at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:958)
[cassandra-dc03-3]      at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:748)
[cassandra-dc03-3]      at org.apache.cassandra.repair.RepairSession.forceShutdown(RepairSession.java:342)
[cassandra-dc03-3]      at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:412)
[cassandra-dc03-3]      at org.apache.cassandra.service.StorageService.forceTerminateAllRepairSessions(StorageService.java:4200)
[cassandra-dc03-3]      at java. Base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[cassandra-dc03-3]      at java. Base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
[cassandra-dc03-3]      at java. Base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
[cassandra-dc03-3]      at java. Base/java.lang.reflect.Method.invoke(Unknown Source)
[cassandra-dc03-3]      at sun.reflect.misc.Trampoline.invoke(Unknown Source)
[cassandra-dc03-3]      at jdk.internal.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
[cassandra-dc03-3]      at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
[cassandra-dc03-3]      at java. Base/java.lang.reflect.Method.invoke(Unknown Source)
[cassandra-dc03-3]      at java. Base/sun.reflect.misc.MethodUtil.invoke(Unknown Source)
[cassandra-dc03-3]      at java. Management/com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(Unknown Source)
[cassandra-dc03-3]      at java. Management/com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(Unknown Source)
[cassandra-dc03-3]      at java. Management/com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(Unknown Source)
[cassandra-dc03-3]      at java. Management/com.sun.jmx.mbeanserver.PerInterface.invoke(Unknown Source)
[cassandra-dc03-3]      at java. Management/com.sun.jmx.mbeanserver.MBeanSupport.invoke(Unknown Source)
[cassandra-dc03-3]      at java. Management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(Unknown Source)
[cassandra-dc03-3]      at java. Management/com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(Unknown Source)
[cassandra-dc03-3]      at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.doOperation(Unknown Source)
[cassandra-dc03-3]      at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(Unknown Source)
[cassandra-dc03-3]      at java.base/java.security.AccessController.doPrivileged(Native Method)
[cassandra-dc03-3]      at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(Unknown Source)
[cassandra-dc03-3]      at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.invoke(Unknown Source)
[cassandra-dc03-3]      at java.base/jdk.internal.reflect.GeneratedMethodAccessor42.invoke(Unknown Source)
[cassandra-dc03-3]      at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
[cassandra-dc03-3]      at java.base/java.lang.reflect.Method.invoke(Unknown Source)
[cassandra-dc03-3]      at java.rmi/sun.rmi.server.UnicastServerRef.dispatch(Unknown Source)
[cassandra-dc03-3]      at java.rmi/sun.rmi.transport.Transport$1.run(Unknown Source)
[cassandra-dc03-3]      at java.rmi/sun.rmi.transport.Transport$1.run(Unknown Source)
[cassandra-dc03-3]      at java.base/java.security.AccessController.doPrivileged(Native Method)
[cassandra-dc03-3]      at java.rmi/sun.rmi.transport.Transport.serviceCall(Unknown Source)
[cassandra-dc03-3]      at java.rmi/sun.rmi.transport.tcp.TCPTransport.handleMessages(Unknown Source)
[cassandra-dc03-3]      at java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source)
[cassandra-dc03-3]      at java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(Unknown Source)
[cassandra-dc03-3]      at java.base/java.security.AccessController.doPrivileged(Native Method)
[cassandra-dc03-3]      at java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source)
[cassandra-dc03-3]      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
[cassandra-dc03-3]      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
[cassandra-dc03-3]      at java.base/java.lang.Thread.run(Unknown Source)
[cassandra-dc03-3] Caused by: java.io.IOException: Terminate session is called
[cassandra-dc03-3]      at org.apache.cassandra.service.ActiveRepairService.terminateSessions(ActiveRepairService.java:409)
[cassandra-dc03-3]      ... 38 common frames omitted

The CLI repair working flawlessly on the same keyspace (it's super-fast, because it's in one DC and has a few ranges and few data only, so I run the CLI repair from crontab):

[2023-01-13 16:31:34,781] Starting repair command #23 (bea11f80-935f-11ed-aecc-e500511a7b1c), repairing keyspace p20220405 with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 48, pull repair: false, force repair: false, optimise streams: false, ignore unreplicated keyspaces: false)
[2023-01-13 16:31:41,028] Repair session bea7af30-935f-11ed-aecc-e500511a7b1c for range [(8375886178349756625,8748449053372153397], (-192514148029231763,75621920349127064], (7023965147399560621,7304385562009723812], (-782996530813933592,-584335104639577196], (75621920349127064,440469675443902358], (3484874585837077530,3928443573923766557], (-5553629018076287121,-5344290176620942342], (2699042956940947999,2832401673688828268], (2832401673688828268,3232614989114008566], (6663946002768270856,7023965147399560621], (-584335104639577196,-192514148029231763], (3232614989114008566,3484874585837077530], (-6943915714385929091,-6728252751992420541], (-6728252751992420541,-6423662984068543999], (8149126386130840257,8375886178349756625], (3928443573923766557,4290171673804057124], (6511701223995286701,6663946002768270856]] finished (progress: 7%)
[2023-01-13 16:31:42,161] Repair session c0cdea90-935f-11ed-aecc-e500511a7b1c for range [(4290171673804057124,4542251102766035747], (-4100770170940350368,-3544889136023426432], (8748449053372153397,8986707583684987286], (-4347391553646483944,-4100770170940350368], (-7937683373524817555,-7640032189044617929], (440469675443902358,674085611305865890], (-4896718836723061954,-4594012936352617520], (-4594012936352617520,-4347391553646483944], (-6013957287594759548,-5762967859531631900], (-1824995112033495814,-1543401375347916232], (-8235334558005017180,-7937683373524817555], (-8560168480947503225,-8235334558005017180], (-3544889136023426432,-3339147531942620651], (7686096653701801857,7922366593911923889], (-3339147531942620651,-3133405927861814870]] finished (progress: 9%)
[2023-01-13 16:31:43,502] Repair session bf5ad290-935f-11ed-aecc-e500511a7b1c for range [(-7640032189044617929,-7159578676779437640], (5019740684615978816,5276175336550084046], (8986707583684987286,-9053006490617367975], (-5762967859531631900,-5553629018076287121], (4542251102766035747,5019740684615978816], (5276175336550084046,5532609988484189277], (-2009166707846883473,-1824995112033495814], (7922366593911923889,8149126386130840257], (-8806587485782435600,-8560168480947503225], (-7159578676779437640,-6943915714385929091], (-9053006490617367975,-8806587485782435600], (674085611305865890,1053767372865241140], (2134672759262399647,2565684240193067730], (2565684240193067730,2699042956940947999], (5532609988484189277,5864117332107140574], (-2193338303660271132,-2009166707846883473]] finished (progress: 11%)
[2023-01-13 16:31:43,512] Repair completed successfully
[2023-01-13 16:31:43,512] Repair command #23 finished in 8 seconds
[2023-01-13 16:31:43,521] condition satisfied queried for parent session status and discovered repair completed.
[2023-01-13 16:31:43,529] Repair completed successfully

So, I don't know, which side has the issue, but it looks like not a Cassandra issue.

Also, for info, I checked the repairs, and the failing repairs has negative segment timeout, the other repairs have positive segment timeout:

Segment timeout (mins) | -268435456

Some other information:

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: REAP-74

adejanovski commented 1 year ago

Hi @gaborauth,

are these repair runs created by schedules? Is the adaptive feature enabled on these schedules? Could you check the segment timeout on the schedule?

The number of segments or segment timeout can be adjusted if the adaptive feature is enabled, but aside from a sad sad bug that allows the timeout to go into negative values, I'm a little puzzled with this behavior.

Sometimes the affected nodes are failed too and restarted by Kubernetes.

Ouch, that's very weird as well 🤔 Anything else in the logs of these nodes when they are restarted?

gaborauth commented 1 year ago

are these repair runs created by schedules? Is the adaptive feature enabled on these schedules?

Yes and yes.

Could you check the segment timeout on the schedule?

The parameters of the schedule:

Next run | 20 January 2023 08:39
Owner | auto-scheduling
Incremental | false
Segment count per node | 19
Intensity | 0.8999999761581421
Repair threads | 1
Segment timeout (mins) | -268435456
Repair parallelism | DATACENTER_AWARE
Creation time | 6 November 2022 21:36
Adaptive | true

Anything else in the logs of these nodes when they are restarted?

I have no logs from that period (I paused it on late December). BTW: the repair process creates snapshots per repair session and then left them hang, so, maybe these bunch of snapshots cause some lack of resource and led to restart, I had to delete them manually.

adejanovski commented 1 year ago

ok, this doesn't seem to be due to the adaptive nature of the schedule. The only things an adaptive schedule can get adjusted is:

We're never lowering the timeout.

I've checked the autoscheduler code, but it'll use the default timeout.

Could you check the timeout configured for the other schedules? Could you modify the timeout for this schedule and monitor if it gets lowered by Reaper itself?

gaborauth commented 1 year ago

Could you check the timeout configured for the other schedules?

I've just checked some: Segment timeout (mins) | 491520 Segment timeout (mins) | 62914560 Segment timeout (mins) | -268435456 Segment timeout (mins) | 960 Segment timeout (mins) | 240 Segment timeout (mins) | 15360 Segment timeout (mins) | 491520 Segment timeout (mins) | 7680 Segment timeout (mins) | 62914560 Segment timeout (mins) | 61440

Could you modify the timeout for this schedule and monitor if it gets lowered by Reaper itself?

Where can I modify? It's not appeared on the edit dialog. Is it safe to modify directly in the cassandra_reaper.repair_unit_v1 table (timeout field)?

adejanovski commented 1 year ago

You should have access to the timeout under the Advanced settings section of the edit dialog. Click on "Advanced settings" to make that appear.

adejanovski commented 1 year ago

actually you're right, I'm not seeing it either 😓 We need to make it accessible. Changing the value in the table directly is safe.

gaborauth commented 1 year ago

Changing the value in the table directly is safe.

Okay, I've just updated timeout of all repair units to '60', had to restart the Reaper and... it does the magic, looks like working. :)

I will check the timeout values regularly.

We're never lowering the timeout.

The increasing of the timeout is overflow safe?

adejanovski commented 1 year ago

The increasing of the timeout is overflow safe?

Looking at the numbers you have, I think we should put a max threshold over which Reaper cannot increase it anymore.

gaborauth commented 1 year ago

I tried the duplication, and the -268435456 is definitely an overflow of the 32 bits wide integer type, so, a reasonable threshold will solve it:

60
120
240
480
960
1920
3840
7680
15360
30720
61440
122880
245760
491520
983040
1966080
3932160
7864320
15728640
31457280
62914560
125829120
251658240
503316480
1006632960
2013265920
-268435456