Repair of a node from a new data center failed

@igorsimb commented on Mon May 23 2022

Installation details

Kernel Version: 5.13.0-1024-gcp Scylla version (or git commit hash): 2022.1~rc5-20220515.6a1e89fbb with build-id 5cecadda59974548befb4305363bf374631fc3e1 Cluster size: 6 nodes (n1-highmem-16)

Scylla Nodes used in this run:

longevity-10gb-3h-2022-1-db-node-e329d26e-0-8 (35.185.108.243 | 10.142.0.8) (shards: 16)
longevity-10gb-3h-2022-1-db-node-e329d26e-0-7 (34.138.186.46 | 10.142.0.44) (shards: 16)
longevity-10gb-3h-2022-1-db-node-e329d26e-0-6 (34.139.74.11 | 10.142.0.190) (shards: 16)
longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 (34.73.46.157 | 10.142.0.189) (shards: 16)
longevity-10gb-3h-2022-1-db-node-e329d26e-0-4 (35.227.113.106 | 10.142.0.188) (shards: 16)
longevity-10gb-3h-2022-1-db-node-e329d26e-0-3 (34.138.75.240 | 10.142.0.187) (shards: 16)
longevity-10gb-3h-2022-1-db-node-e329d26e-0-2 (34.138.209.254 | 10.142.0.186) (shards: 16)
longevity-10gb-3h-2022-1-db-node-e329d26e-0-1 (35.227.13.185 | 10.142.0.144) (shards: 16)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/4317592650001557950 (gce: us-east1)

Test: longevity-10gb-3h-gce-test Test id: e329d26e-9900-4665-93c8-f9d0f78a3657 Test name: enterprise-2022.1/longevity/longevity-10gb-3h-gce-test Test config file(s):

longevity-10gb-3h.yaml

Issue description

This scenario's cluster is a single-node DC with 6 nodes. At 2022-05-17 15:44:42.553 Nemesis AddRemoveDc started. It creates a new node and a new data center. Then it changes the replication strategy to NetworkStrategy and rebuilds the new node using "nodetool rebuild" command. It then runs a full cluster repair on each node. Afterwards it's supposed to decommission the node.

The repairs of the first 6 nodes from the original DC went fine, however the repair of the 7th node failed:

Command: '/usr/bin/nodetool  repair -pr '

Exit code: 2

Stdout:

[2022-05-17 15:51:52,511] Starting repair command #1, repairing 1 ranges for keyspace system_traces (parallelism=SEQUENTIAL, full=true)
[2022-05-17 15:52:14,612] Repair session 1 failed
[2022-05-17 15:52:14,613] Repair session 1 finished

Stderr:

error: Repair job has failed with the error message: [2022-05-17 15:52:14,612] Repair session 1 failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: [2022-05-17 15:52:14,612] Repair session 1 failed
        at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:124)
        at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
        at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
        at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
        at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
        at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)

From the log of the node 7:

2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-7 ! WARNING |  [shard 12] repair - repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 12 failed - 80 o
ut of 256 ranges failed
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-7 !    INFO |  [shard 0] repair - repair[bc5599f3-58ea-4629-b1c8-bf9bc6a05770]: Started to shutdown off-strategy compact
ion updater
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-7 !    INFO |  [shard 0] repair - repair[bc5599f3-58ea-4629-b1c8-bf9bc6a05770]: Finished to shutdown off-strategy compac
tion updater
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-7 ! WARNING |  [shard 0] repair - repair_tracker run for repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] fai
led: std::runtime_error ({shard 0: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 0 failed to repair 80 out of 256 ranges), shard 1: std::runtime_er
ror (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 1 failed to repair 80 out of 256 ranges), shard 2: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8
-bf9bc6a05770] on shard 2 failed to repair 80 out of 256 ranges), shard 3: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 3 failed to repair 80 out 
of 256 ranges), shard 4: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 4 failed to repair 80 out of 256 ranges), shard 5: std::runtime_error (repai
r id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 5 failed to repair 80 out of 256 ranges), shard 6: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05
770] on shard 6 failed to repair 80 out of 256 ranges), shard 7: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 7 failed to repair 80 out of 256 ran
ges), shard 8: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 8 failed to repair 80 out of 256 ranges), shard 9: std::runtime_error (repair id [id=1
, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 9 failed to repair 80 out of 256 ranges), shard 10: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on s
hard 10 failed to repair 80 out of 256 ranges), shard 11: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 11 failed to repair 80 out of 256 ranges), 
shard 12: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 12 failed to repair 80 out of 256 ranges), shard 13: std::runtime_error (repair id [id=1, u
uid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 13 failed to repair 80 out of 256 ranges), shard 14: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on sha
rd 14 failed to repair 80 out of 256 ranges), shard 15: std::runtime_error (repair id [id=1, uuid=bc5599f3-58ea-4629-b1c8-bf9bc6a05770] on shard 15 failed to repair 80 out of 256 ranges)})

I didn't find any concrete explanation of why the repair failed in the node 7's logs.

Restore Monitor Stack command: $ hydra investigate show-monitor e329d26e-9900-4665-93c8-f9d0f78a3657
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs e329d26e-9900-4665-93c8-f9d0f78a3657

Logs:

db-cluster-e329d26e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e329d26e-9900-4665-93c8-f9d0f78a3657/20220517_174621/db-cluster-e329d26e.tar.gz
monitor-set-e329d26e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e329d26e-9900-4665-93c8-f9d0f78a3657/20220517_174621/monitor-set-e329d26e.tar.gz
loader-set-e329d26e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e329d26e-9900-4665-93c8-f9d0f78a3657/20220517_174621/loader-set-e329d26e.tar.gz
sct-runner-e329d26e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e329d26e-9900-4665-93c8-f9d0f78a3657/20220517_174621/sct-runner-e329d26e.tar.gz
parallel-timelines-report-e329d26e.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/e329d26e-9900-4665-93c8-f9d0f78a3657/20220517_174621/parallel-timelines-report-e329d26e.tar.gz

Jenkins job URL

@KnifeyMoloko commented on Mon May 23 2022

Similar failure in:

Installation details

Kernel Version: 5.13.0-1022-aws Scylla version (or git commit hash): 5.1.dev-20220504.b26a3da584cc with build-id ab2a33a30756c1513f4c516cd272291e75acec0e Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

longevity-twcs-48h-master-db-node-3c379dfb-7 (13.48.67.90 | 10.0.2.144) (shards: 8)
longevity-twcs-48h-master-db-node-3c379dfb-6 (16.170.217.56 | 10.0.0.5) (shards: 8)
longevity-twcs-48h-master-db-node-3c379dfb-5 (13.51.55.189 | 10.0.3.195) (shards: 8)
longevity-twcs-48h-master-db-node-3c379dfb-4 (13.48.105.53 | 10.0.3.47) (shards: 8)
longevity-twcs-48h-master-db-node-3c379dfb-3 (13.49.225.221 | 10.0.0.207) (shards: 8)
longevity-twcs-48h-master-db-node-3c379dfb-2 (13.49.225.56 | 10.0.3.25) (shards: 8)
longevity-twcs-48h-master-db-node-3c379dfb-1 (13.48.106.56 | 10.0.2.21) (shards: 8)

OS / Image: ami-0f0e4c1a732cd9815 (aws: eu-north-1)

Test: longevity-twcs-48h-test Test id: 3c379dfb-8c1b-4a2f-895f-510272bef66c Test name: scylla-master/longevity/longevity-twcs-48h-test Test config file(s):

longevity-twcs-48h.yaml

Issue description

>>>>>>> While running scylla-bench stress, during DestroyDataThenRepair nemesis, which corrupts some data, restarts the target node and the triggers a repair operation on the target node.

sct.log

 t:2022-05-10 22:55:16,885 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-05-10T22:55:16+00:00 longevity-twcs-48h-master-db-node-3c379dfb-5 !    INFO |  [shard 0] repair - repair[e8c981f3-6292-425a-bdc8-0f3b2904f742]: Finished to shutdown off-strategy compaction updater
< t:2022-05-10 22:55:16,886 f:db_log_reader.py l:113  c:sdcm.db_log_reader   p:DEBUG > 2022-05-10T22:55:16+00:00 longevity-twcs-48h-master-db-node-3c379dfb-5 ! WARNING |  [shard 0] repair - repair[e8c981f3-6292-425a-bdc8-0f3b2904f742]: repair_tracker run failed: std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is closed), shard 1: seastar::rpc::closed_error (connection is closed), shard 2: seastar::rpc::closed_error (connection is closed), shard 3: seastar::rpc::closed_error (connection is closed), shard 4: seastar::rpc::closed_error (connection is closed), shard 5: seastar::rpc::closed_error (connection is closed), shard 6: seastar::rpc::closed_error (connection is closed), shard 7: seastar::rpc::closed_error (connection is closed)})
< t:2022-05-10 22:55:16,886 f:events_processes.py l:146  c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20220510-135454-661869,id=0x7f2a355f08e0,default=True]
< t:2022-05-10 22:55:16,887 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:DEBUG > 2022-05-10 22:55:16.886 <2022-05-10 22:55:16.000>: (DatabaseLogEvent Severity.WARNING) period_type=one-time event_id=e6583509-6746-4932-935c-6e829a8ccc71: type=WARNING regex=!\s*?WARNING  line_number=152616 node=longevity-twcs-48h-master-db-node-3c379dfb-5
< t:2022-05-10 22:55:16,887 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:DEBUG > 2022-05-10T22:55:16+00:00 longevity-twcs-48h-master-db-node-3c379dfb-5 ! WARNING |  [shard 0] repair - repair[e8c981f3-6292-425a-bdc8-0f3b2904f742]: repair_tracker run failed: std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is closed), shard 1: seastar::rpc::closed_error (connection is closed), shard 2: seastar::rpc::closed_error (connection is closed), shard 3: seastar::rpc::closed_error (connection is closed), shard 4: seastar::rpc::closed_error (connection is closed), shard 5: seastar::rpc::closed_error (connection is closed), shard 6: seastar::rpc::closed_error (connection is closed), shard 7: seastar::rpc::closed_error (connection is closed)})

On the target node (node-5):

May 10 22:52:49 longevity-twcs-48h-master-db-node-3c379dfb-5 scylla[4507]:  [shard 2] repair - repair[e8c981f3-6292-425a-bdc8-0f3b2904f742]: Repair 773 out of 773 ranges, shard=2, keyspace=scylla_bench, table={test, test_counters}, range=(9111342517421337944, 9134267295213569042], peers={10.0.0.5, 10.0.0.207}, live_peers={10.0.0.5, 10.0.0.207}
May 10 22:53:12 longevity-twcs-48h-master-db-node-3c379dfb-5 scylla[4507]:  [shard 2] large_data - Writing large partition scylla_bench/test: 1108101562368 (66408933 bytes) to me-70778-big-Data.db
May 10 22:54:39 longevity-twcs-48h-master-db-node-3c379dfb-5 scylla[4507]:  [shard 2] large_data - Writing large partition scylla_bench/test: 1112396529664 (350621549 bytes) to me-70738-big-Data.db
May 10 22:55:15 longevity-twcs-48h-master-db-node-3c379dfb-5 scylla[4507]:  [shard 2] large_data - Writing large partition scylla_bench/test: 1541893259264 (350680199 bytes) to me-70786-big-Data.db
May 10 22:55:16 longevity-twcs-48h-master-db-node-3c379dfb-5 scylla[4507]:  [shard 0] repair - repair[e8c981f3-6292-425a-bdc8-0f3b2904f742]: Started to shutdown off-strategy compaction updater
May 10 22:55:16 longevity-twcs-48h-master-db-node-3c379dfb-5 scylla[4507]:  [shard 0] repair - repair[e8c981f3-6292-425a-bdc8-0f3b2904f742]: Finished to shutdown off-strategy compaction updater
May 10 22:55:16 longevity-twcs-48h-master-db-node-3c379dfb-5 scylla[4507]:  [shard 0] repair - repair[e8c981f3-6292-425a-bdc8-0f3b2904f742]: repair_tracker run failed: std::runtime_error ({shard 0: seastar::rpc::closed_error (connection is closed), shard 1: seastar::rpc::closed_error (connection is closed), shard 2: seastar::rpc::closed_error (connection is closed), shard 3: seastar::rpc::closed_error (connection is closed), shard 4: seastar::rpc::closed_error (connection is closed), shard 5: seastar::rpc::closed_error (connection is closed), shard 6: seastar::rpc::closed_error (connection is closed), shard 7: seastar::rpc::closed_error (connection is closed)})

<<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor 3c379dfb-8c1b-4a2f-895f-510272bef66c
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 3c379dfb-8c1b-4a2f-895f-510272bef66c

Logs:

db-cluster-3c379dfb.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/db-cluster-3c379dfb.tar.gz
monitor-set-3c379dfb.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/monitor-set-3c379dfb.tar.gz
loader-set-3c379dfb.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/loader-set-3c379dfb.tar.gz
normal-3c379dfb.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/normal-3c379dfb.log.tar.gz
summary-3c379dfb.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/summary-3c379dfb.log.tar.gz
events-3c379dfb.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/events-3c379dfb.log.tar.gz
output-3c379dfb.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/output-3c379dfb.log.tar.gz
debug-3c379dfb.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/debug-3c379dfb.log.tar.gz
sct-3c379dfb.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/sct-3c379dfb.log.tar.gz
error-3c379dfb.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/error-3c379dfb.log.tar.gz
critical-3c379dfb.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/critical-3c379dfb.log.tar.gz
raw_events-3c379dfb.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/raw_events-3c379dfb.log.tar.gz
warning-3c379dfb.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/warning-3c379dfb.log.tar.gz
email_data-3c379dfb.json.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/email_data-3c379dfb.json.tar.gz
argus-3c379dfb.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/argus-3c379dfb.log.tar.gz
left_processes-3c379dfb.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/3c379dfb-8c1b-4a2f-895f-510272bef66c/20220512_062954/left_processes-3c379dfb.log.tar.gz

Jenkins job URL

@asias commented on Tue May 24 2022

node5 was rebooted

2022-05-17T15:51:37+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |  [shard 13] repair - Repair 43 out of 256 ranges, id=[id=3, uuid=c1f9f7c7-3135-45fb-8d2a-133b874bbc18], shard=13, keyspace=system_traces, table={node_slow_log_time_idx, sessions, sessions_time_idx, node_slow_log, events}, range=(-6272472925117774399, -6268556193531332347], peers={10.142.0.144, 10.142.0.44}, live_peers={10.142.0.144, 10.142.0.44}
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !  NOTICE | Linux version 5.13.0-1024-gcp (buildd@lcy02-amd64-110) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #29~20.04.1-Ubuntu SMP Thu Apr 14 23:15:00 UTC 2022 (Ubuntu 5.13.0-1024.29~20.04.1-gcp 5.13.19)
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | Command line: BOOT_IMAGE=/boot/vmlinuz-5.13.0-1024-gcp root=PARTUUID=0ad9b47d-8aac-4376-896f-3c4a6893f6e5 ro console=ttyS0 net.ifnames=0 clocksource=tsc tsc=reliable panic=-1
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | KERNEL supported cpus:
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |   Intel GenuineIntel
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |   AMD AuthenticAMD
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |   Hygon HygonGenuine
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |   Centaur CentaurHauls
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |   zhaoxin   Shanghai
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point

repair failed because one of the nodes (node5 with10.142.0.189 i ) was down

$ cat longevity-10gb-3h-2022-1-db-node-e329d26e-0-*/messages.log |grep DOWN|grep 15:52
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-1 !    INFO |  [shard 3] gossip - failure_detector_loop: Mark node 10.142.0.189 as DOWN
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-1 !    INFO |  [shard 0] gossip - InetAddress 10.142.0.189 is now DOWN, status = NORMAL
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-2 !    INFO |  [shard 3] gossip - failure_detector_loop: Mark node 10.142.0.189 as DOWN
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-2 !    INFO |  [shard 0] gossip - InetAddress 10.142.0.189 is now DOWN, status = NORMAL
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-3 !    INFO |  [shard 3] gossip - failure_detector_loop: Mark node 10.142.0.189 as DOWN
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-3 !    INFO |  [shard 0] gossip - InetAddress 10.142.0.189 is now DOWN, status = NORMAL
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-4 !    INFO |  [shard 3] gossip - failure_detector_loop: Mark node 10.142.0.189 as DOWN
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-4 !    INFO |  [shard 0] gossip - InetAddress 10.142.0.189 is now DOWN, status = NORMAL
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-4 !    INFO |  [shard 3] gossip - failure_detector_loop: Mark node 10.142.0.189 as DOWN
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-4 !    INFO |  [shard 0] gossip - InetAddress 10.142.0.189 is now DOWN, status = NORMAL
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-6 !    INFO |  [shard 4] gossip - failure_detector_loop: Mark node 10.142.0.189 as DOWN
2022-05-17T15:52:14+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-6 !    INFO |  [shard 0] gossip - InetAddress 10.142.0.189 is now DOWN, status = NORMAL
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-7 !    INFO |  [shard 4] gossip - failure_detector_loop: Mark node 10.142.0.189 as DOWN
2022-05-17T15:52:13+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-7 !    INFO |  [shard 0] gossip - InetAddress 10.142.0.189 is now DOWN, status = NORMAL

@slivne commented on Tue May 24 2022

@igorsimb / @KnifeyMoloko asias is correct at least in the above case - the node 0.189 restarted Screenshot from 2022-05-24 15-18-05

and it seems the node was restarted forcefully

2761083919879, -6891536772012819571], peers={10.142.0.186, 10.142.0.44}, live_peers={10.142.0.186, 10.142.0.44}
2022-05-17T15:51:37+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |  [shard 0] repair - Repair 50 out of 256 ranges, id=[id=3, uuid=c1f9f7c7-3135-45fb-8d2a-133b874bbc18], shard=0, keyspace=system_traces, table={node_slow_log_time_idx, sessions, sessions_time_idx, node_slow_log, events}, range=(-582556
3499293021555, -5805040756592650795], peers={10.142.0.190, 10.142.0.44}, live_peers={10.142.0.190, 10.142.0.44}
2022-05-17T15:51:37+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |  [shard 8] repair - Repair 36 out of 256 ranges, id=[id=3, uuid=c1f9f7c7-3135-45fb-8d2a-133b874bbc18], shard=8, keyspace=system_traces, table={node_slow_log_time_idx, sessions, sessions_time_idx, node_slow_log, events}, range=(-685151
5090273075789, -6846271484729155050], peers={10.142.0.187, 10.142.0.44}, live_peers={10.142.0.187, 10.142.0.44}
2022-05-17T15:51:37+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |  [shard 12] repair - Repair 48 out of 256 ranges, id=[id=3, uuid=c1f9f7c7-3135-45fb-8d2a-133b874bbc18], shard=12, keyspace=system_traces, table={node_slow_log_time_idx, sessions, sessions_time_idx, node_slow_log, events}, range=(-5936
185119577157938, -5899517185000065022], peers={10.142.0.190, 10.142.0.44}, live_peers={10.142.0.190, 10.142.0.44}
2022-05-17T15:51:37+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |  [shard 4] repair - Repair 53 out of 256 ranges, id=[id=3, uuid=c1f9f7c7-3135-45fb-8d2a-133b874bbc18], shard=4, keyspace=system_traces, table={node_slow_log_time_idx, sessions, sessions_time_idx, node_slow_log, events}, range=(-561630
3704245844764, -5613933658320749120], peers={10.142.0.188, 10.142.0.44}, live_peers={10.142.0.188, 10.142.0.44}
2022-05-17T15:51:37+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |  [shard 13] repair - Repair 42 out of 256 ranges, id=[id=3, uuid=c1f9f7c7-3135-45fb-8d2a-133b874bbc18], shard=13, keyspace=system_traces, table={node_slow_log_time_idx, sessions, sessions_time_idx, node_slow_log, events}, range=(-6296
626305099237645, -6293094195138945605], peers={10.142.0.188, 10.142.0.44}, live_peers={10.142.0.188, 10.142.0.44}
2022-05-17T15:51:37+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |  [shard 2] repair - Repair 35 out of 256 ranges, id=[id=3, uuid=c1f9f7c7-3135-45fb-8d2a-133b874bbc18], shard=2, keyspace=system_traces, table={node_slow_log_time_idx, sessions, sessions_time_idx, node_slow_log, events}, range=(-689725
2761083919879, -6891536772012819571], peers={10.142.0.186, 10.142.0.44}, live_peers={10.142.0.186, 10.142.0.44}
2022-05-17T15:51:37+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |  [shard 2] repair - Repair 36 out of 256 ranges, id=[id=3, uuid=c1f9f7c7-3135-45fb-8d2a-133b874bbc18], shard=2, keyspace=system_traces, table={node_slow_log_time_idx, sessions, sessions_time_idx, node_slow_log, events}, range=(-685151
5090273075789, -6846271484729155050], peers={10.142.0.187, 10.142.0.44}, live_peers={10.142.0.187, 10.142.0.44}
2022-05-17T15:51:37+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |  [shard 13] repair - Repair 43 out of 256 ranges, id=[id=3, uuid=c1f9f7c7-3135-45fb-8d2a-133b874bbc18], shard=13, keyspace=system_traces, table={node_slow_log_time_idx, sessions, sessions_time_idx, node_slow_log, events}, range=(-6272
472925117774399, -6268556193531332347], peers={10.142.0.144, 10.142.0.44}, live_peers={10.142.0.144, 10.142.0.44}
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !  NOTICE | Linux version 5.13.0-1024-gcp (buildd@lcy02-amd64-110) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #29~20.04.1-Ubuntu SMP Thu Apr 14 23:15:00 UTC 2022 (Ubuntu 5.13.0-1024.29~20.04.1-gcp 5.13.19)
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | Command line: BOOT_IMAGE=/boot/vmlinuz-5.13.0-1024-gcp root=PARTUUID=0ad9b47d-8aac-4376-896f-3c4a6893f6e5 ro console=ttyS0 net.ifnames=0 clocksource=tsc tsc=reliable panic=-1
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | KERNEL supported cpus:
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |   Intel GenuineIntel
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |   AMD AuthenticAMD
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |   Hygon HygonGenuine
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |   Centaur CentaurHauls
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO |   zhaoxin   Shanghai  
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | BIOS-provided physical RAM map:
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | BIOS-e820: [mem 0x0000000000001000-0x0000000000054fff] usable
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | BIOS-e820: [mem 0x0000000000055000-0x000000000005ffff] reserved
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | BIOS-e820: [mem 0x0000000000060000-0x0000000000097fff] usable
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | BIOS-e820: [mem 0x0000000000098000-0x000000000009ffff] reserved
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | BIOS-e820: [mem 0x0000000000100000-0x00000000be8eefff] usable
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | BIOS-e820: [mem 0x00000000be8ef000-0x00000000beb6efff] reserved
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | BIOS-e820: [mem 0x00000000beb6f000-0x00000000beb7efff] ACPI data
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | BIOS-e820: [mem 0x00000000beb7f000-0x00000000bebfefff] ACPI NVS
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | BIOS-e820: [mem 0x00000000bebff000-0x00000000bffdffff] usable
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | BIOS-e820: [mem 0x0000000100000000-0x0000001a3fffffff] usable
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | NX (Execute Disable) protection: active
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | efi: EFI v2.70 by EDK II
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !    INFO | efi: TPMFinalLog=0xbebf7000 ACPI=0xbeb7e000 ACPI 2.0=0xbeb7e014 SMBIOS=0xbe9cc000 MEMATTR=0xbd36c018 MOKvar=0xbd371000 RNG=0xbe9cdd18 TPMEventLog=0xbd363018 
2022-05-17T15:52:58+00:00 longevity-10gb-3h-2022-1-db-node-e329d26e-0-5 !  NOTICE | efi: seeding entropy pool

reassigning this to QA

scylladb / scylla-cluster-tests