OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/4317592650001557950 (gce: us-east1)
Test: longevity-10gb-3h-gce-test
Test id: e329d26e-9900-4665-93c8-f9d0f78a3657
Test name: enterprise-2022.1/longevity/longevity-10gb-3h-gce-test
Test config file(s):
This scenario's cluster is a single-node DC with 6 nodes.
At 2022-05-17 15:44:42.553 Nemesis AddRemoveDc started. It creates a new node and a new data center. Then it changes the replication strategy to NetworkStrategy and rebuilds the new node using "nodetool rebuild" command. It then runs a full cluster repair on each node. Afterwards it's supposed to decommission the node.
The repairs of the first 6 nodes from the original DC went fine, however the repair of the 7th node failed:
Command: '/usr/bin/nodetool repair -pr '
Exit code: 2
Stdout:
[2022-05-17 15:51:52,511] Starting repair command #1, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)
[2022-05-17 15:52:14,612] Repair session 1 failed
[2022-05-17 15:52:14,613] Repair session 1 finished
Stderr:
error: Repair job has failed with the error message: [2022-05-17 15:52:14,612] Repair session 1 failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: [2022-05-17 15:52:14,612] Repair session 1 failed
at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:124)
at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
From the log of the node 7:
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-7 ! WARNING | [shard 12] repair - repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 12 failed - 80 o
ut of 256 ranges failed
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-7 ! INFO | [shard 0] repair - repair[bc5599f3-58ea-4629-b1c8-bf9bc6a05770]: Started to shutdown off-strategy compact
ion updater
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-7 ! INFO | [shard 0] repair - repair[bc5599f3-58ea-4629-b1c8-bf9bc6a05770]: Finished to shutdown off-strategy compac
tion updater
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-7 ! WARNING | [shard 0] repair - repair_tracker run for repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] fai
led: std::runtime_error ({shard 0: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 0 failed to repair 80 out of 256 ranges), shard 1: std::runtime_er
ror (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 1 failed to repair 80 out of 256 ranges), shard 2: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8
-bf9bc6a05770] on shard 2 failed to repair 80 out of 256 ranges), shard 3: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 3 failed to repair 80 out
of 256 ranges), shard 4: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 4 failed to repair 80 out of 256 ranges), shard 5: std::runtime_error (repai
r id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 5 failed to repair 80 out of 256 ranges), shard 6: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05
770] on shard 6 failed to repair 80 out of 256 ranges), shard 7: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 7 failed to repair 80 out of 256 ran
ges), shard 8: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 8 failed to repair 80 out of 256 ranges), shard 9: std::runtime_error (repair id [id=1
, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 9 failed to repair 80 out of 256 ranges), shard 10: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on s
hard 10 failed to repair 80 out of 256 ranges), shard 11: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 11 failed to repair 80 out of 256 ranges),
shard 12: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 12 failed to repair 80 out of 256 ranges), shard 13: std::runtime_error (repair id [id=1, u
uid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 13 failed to repair 80 out of 256 ranges), shard 14: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on sha
rd 14 failed to repair 80 out of 256 ranges), shard 15: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 15 failed to repair 80 out of 256 ranges)})
I didn't find any concrete explanation of why the repair failed in the node 7's logs.
OS / Image: ami-0f0e4c1a732cd9815 (aws: eu-north-1)
Test: longevity-twcs-48h-test
Test id: 3c379dfb-8c1b-4a2f-895f-510272bef66c
Test name: scylla-master/longevity/longevity-twcs-48h-test
Test config file(s):
>>>>>>>
While running scylla-bench stress, during DestroyDataThenRepair nemesis, which corrupts some data, restarts the target node and the triggers a repair operation on the target node.
sct.log
t:2022-05-10 22:55:16,885 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-05-10T22:55:16+00:00 longevity-twcs-48h-master-db-node-3c379dfb-5 ! INFO | [shard 0] repair - repair[e8c981f3-6292-425a-bdc8-0f3b2904f742]: Finished to shutdown off-strategy compaction updater
< t:2022-05-10 22:55:16,886 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-05-10T22:55:16+00:00 longevity-twcs-48h-master-db-node-3c379dfb-5 ! WARNING | [shard 0] repair - repair[e8c981f3-6292-425a-bdc8-0f3b2904f742]: repair_tracker run failed: std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is closed), shard 1: seastar::rpc::closed_error (connection is closed), shard 2: seastar::rpc::closed_error (connection is closed), shard 3: seastar::rpc::closed_error (connection is closed), shard 4: seastar::rpc::closed_error (connection is closed), shard 5: seastar::rpc::closed_error (connection is closed), shard 6: seastar::rpc::closed_error (connection is closed), shard 7: seastar::rpc::closed_error (connection is closed)})
< t:2022-05-10 22:55:16,886 f:events_processes.py l:146 c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20220510-135454-661869,id=0x7f2a355f08e0,default=True]
< t:2022-05-10 22:55:16,887 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:DEBUG > 2022-05-10 22:55:16.886 <2022-05-10 22:55:16.000>: (DatabaseLogEvent Severity.WARNING) period_type=one-time event_id=e6583509-6746-4932-935c-6e829a8ccc71: type=WARNING regex=!\s*?WARNING line_number=152616 node=longevity-twcs-48h-master-db-node-3c379dfb-5
< t:2022-05-10 22:55:16,887 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:DEBUG > 2022-05-10T22:55:16+00:00 longevity-twcs-48h-master-db-node-3c379dfb-5 ! WARNING | [shard 0] repair - repair[e8c981f3-6292-425a-bdc8-0f3b2904f742]: repair_tracker run failed: std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is closed), shard 1: seastar::rpc::closed_error (connection is closed), shard 2: seastar::rpc::closed_error (connection is closed), shard 3: seastar::rpc::closed_error (connection is closed), shard 4: seastar::rpc::closed_error (connection is closed), shard 5: seastar::rpc::closed_error (connection is closed), shard 6: seastar::rpc::closed_error (connection is closed), shard 7: seastar::rpc::closed_error (connection is closed)})
On the target node (node-5):
May 10 22:52:49 longevity-twcs-48h-master-db-node-3c379dfb-5 scylla[4507]: [shard 2] repair - repair[e8c981f3-6292-425a-bdc8-0f3b2904f742]: Repair 773 out of 773 ranges, shard=2, keyspace=scylla_bench, table={test, test_counters}, range=(9111342517421337944, 9134267295213569042], peers={10.0.0.5, 10.0.0.207}, live_peers={10.0.0.5, 10.0.0.207}
May 10 22:53:12 longevity-twcs-48h-master-db-node-3c379dfb-5 scylla[4507]: [shard 2] large_data - Writing large partition scylla_bench/test: 1108101562368 (66408933 bytes) to me-70778-big-Data.db
May 10 22:54:39 longevity-twcs-48h-master-db-node-3c379dfb-5 scylla[4507]: [shard 2] large_data - Writing large partition scylla_bench/test: 1112396529664 (350621549 bytes) to me-70738-big-Data.db
May 10 22:55:15 longevity-twcs-48h-master-db-node-3c379dfb-5 scylla[4507]: [shard 2] large_data - Writing large partition scylla_bench/test: 1541893259264 (350680199 bytes) to me-70786-big-Data.db
May 10 22:55:16 longevity-twcs-48h-master-db-node-3c379dfb-5 scylla[4507]: [shard 0] repair - repair[e8c981f3-6292-425a-bdc8-0f3b2904f742]: Started to shutdown off-strategy compaction updater
May 10 22:55:16 longevity-twcs-48h-master-db-node-3c379dfb-5 scylla[4507]: [shard 0] repair - repair[e8c981f3-6292-425a-bdc8-0f3b2904f742]: Finished to shutdown off-strategy compaction updater
May 10 22:55:16 longevity-twcs-48h-master-db-node-3c379dfb-5 scylla[4507]: [shard 0] repair - repair[e8c981f3-6292-425a-bdc8-0f3b2904f742]: repair_tracker run failed: std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is closed), shard 1: seastar::rpc::closed_error (connection is closed), shard 2: seastar::rpc::closed_error (connection is closed), shard 3: seastar::rpc::closed_error (connection is closed), shard 4: seastar::rpc::closed_error (connection is closed), shard 5: seastar::rpc::closed_error (connection is closed), shard 6: seastar::rpc::closed_error (connection is closed), shard 7: seastar::rpc::closed_error (connection is closed)})
2022-05-17T15:51:37+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 ! INFO | [shard 13] repair - Repair 43 out of 256 ranges, id=[id=3, uuid=c1f9f7c7-3135-45fb-8d2a-133b874bbc18], shard=13, keyspace=system_traces, table={node_slow_log_time_idx, sessions, sessions_time_idx, node_slow_log, events}, range=(-6272472925117774399, -6268556193531332347], peers={10.142.0.144, 10.142.0.44}, live_peers={10.142.0.144, 10.142.0.44}
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 ! NOTICE | Linux version 5.13.0-1024-gcp (buildd@lcy02-amd64-110) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #29~20.04.1-Ubuntu SMP Thu Apr 14 23:15:00 UTC 2022 (Ubuntu 5.13.0-1024.29~20.04.1-gcp 5.13.19)
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 ! INFO | Command line: BOOT_IMAGE=/boot/vmlinuz-5.13.0-1024-gcp root=PARTUUID=0ad9b47d-8aac-4376-896f-3c4a6893f6e5 ro console=ttyS0 net.ifnames=0 clocksource=tsc tsc=reliable panic=-1
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 ! INFO | KERNEL supported cpus:
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 ! INFO | Intel GenuineIntel
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 ! INFO | AMD AuthenticAMD
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 ! INFO | Hygon HygonGenuine
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 ! INFO | Centaur CentaurHauls
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 ! INFO | zhaoxin Shanghai
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 ! INFO | x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point
repair failed because one of the nodes (node5 with10.142.0.189 i ) was down
$ cat longevity-10gb-3h-2022-1-db-node-e329d26e-0-*/messages.log |grep DOWN|grep 15:52
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-1 ! INFO | [shard 3] gossip - failure_detector_loop: Mark node 10.142.0.189 as DOWN
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-1 ! INFO | [shard 0] gossip - InetAddress 10.142.0.189 is now DOWN, status = NORMAL
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-2 ! INFO | [shard 3] gossip - failure_detector_loop: Mark node 10.142.0.189 as DOWN
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-2 ! INFO | [shard 0] gossip - InetAddress 10.142.0.189 is now DOWN, status = NORMAL
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-3 ! INFO | [shard 3] gossip - failure_detector_loop: Mark node 10.142.0.189 as DOWN
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-3 ! INFO | [shard 0] gossip - InetAddress 10.142.0.189 is now DOWN, status = NORMAL
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-4 ! INFO | [shard 3] gossip - failure_detector_loop: Mark node 10.142.0.189 as DOWN
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-4 ! INFO | [shard 0] gossip - InetAddress 10.142.0.189 is now DOWN, status = NORMAL
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-4 ! INFO | [shard 3] gossip - failure_detector_loop: Mark node 10.142.0.189 as DOWN
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-4 ! INFO | [shard 0] gossip - InetAddress 10.142.0.189 is now DOWN, status = NORMAL
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-6 ! INFO | [shard 4] gossip - failure_detector_loop: Mark node 10.142.0.189 as DOWN
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-6 ! INFO | [shard 0] gossip - InetAddress 10.142.0.189 is now DOWN, status = NORMAL
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-7 ! INFO | [shard 4] gossip - failure_detector_loop: Mark node 10.142.0.189 as DOWN
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-7 ! INFO | [shard 0] gossip - InetAddress 10.142.0.189 is now DOWN, status = NORMAL
@igorsimb commented on Mon May 23 2022
Installation details
Kernel Version: 5.13.0-1024-gcp Scylla version (or git commit hash):
2022.1~rc5-20220515.6a1e89fbb
with build-id5cecadda59974548befb4305363bf374631fc3e1
Cluster size: 6 nodes (n1-highmem-16)Scylla Nodes used in this run:
OS / Image:
https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/4317592650001557950
(gce: us-east1)Test:
longevity-10gb-3h-gce-test
Test id:e329d26e-9900-4665-93c8-f9d0f78a3657
Test name:enterprise-2022.1/longevity/longevity-10gb-3h-gce-test
Test config file(s):Issue description
This scenario's cluster is a single-node DC with 6 nodes. At 2022-05-17 15:44:42.553 Nemesis AddRemoveDc started. It creates a new node and a new data center. Then it changes the replication strategy to NetworkStrategy and rebuilds the new node using "nodetool rebuild" command. It then runs a full cluster repair on each node. Afterwards it's supposed to decommission the node.
The repairs of the first 6 nodes from the original DC went fine, however the repair of the 7th node failed:
From the log of the node 7:
I didn't find any concrete explanation of why the repair failed in the node 7's logs.
$ hydra investigate show-monitor e329d26e-9900-4665-93c8-f9d0f78a3657
$ hydra investigate show-logs e329d26e-9900-4665-93c8-f9d0f78a3657
Logs:
Jenkins job URL
@KnifeyMoloko commented on Mon May 23 2022
Similar failure in:
Installation details
Kernel Version: 5.13.0-1022-aws Scylla version (or git commit hash):
5.1.dev-20220504.b26a3da584cc
with build-idab2a33a30756c1513f4c516cd272291e75acec0e
Cluster size: 4 nodes (i3en.2xlarge)Scylla Nodes used in this run:
OS / Image:
ami-0f0e4c1a732cd9815
(aws: eu-north-1)Test:
longevity-twcs-48h-test
Test id:3c379dfb-8c1b-4a2f-895f-510272bef66c
Test name:scylla-master/longevity/longevity-twcs-48h-test
Test config file(s):Issue description
>>>>>>> While running
scylla-bench
stress, duringDestroyDataThenRepair
nemesis, which corrupts some data, restarts the target node and the triggers a repair operation on the target node.sct.log
On the target node (node-5):
<<<<<<<
$ hydra investigate show-monitor 3c379dfb-8c1b-4a2f-895f-510272bef66c
$ hydra investigate show-logs 3c379dfb-8c1b-4a2f-895f-510272bef66c
Logs:
Jenkins job URL
@asias commented on Tue May 24 2022
node5 was rebooted
repair failed because one of the nodes (node5 with10.142.0.189 i ) was down
@slivne commented on Tue May 24 2022
@igorsimb / @KnifeyMoloko asias is correct at least in the above case - the node 0.189 restarted![Screenshot from 2022-05-24 15-18-05](https://user-images.githubusercontent.com/3465480/170032821-25a4045b-9587-4426-b7de-23ff994bd9af.png)
and it seems the node was restarted forcefully
reassigning this to QA