Closed yarongilor closed 1 year ago
@asias can you please have a look
@yarongilor I see there is also a core file:
< t:2019-09-18 06:40:22,124 f:cluster.py l:833 c:sdcm.cluster p:DEBUG > Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-6 [52.214.107.136 | 10.0.42.134] (seed: False): Found coredump file: /var/lib/systemd/coredump/core.scylla.996.fda02d2cfd394d57ad5130894d4813bc.64535.1568788699000000
Did you open an issue for it too?
this coredump was lost due to sct issue. but it may be reproduced in: https://github.com/scylladb/scylla/issues/5086
The error in the log says: [shard 0] storage_proxy - Exception when communicating with 10.0.42.134: std::runtime_error (Can't find a column family with UUID f6c3c240-d892-11e9-b106-000000000005)
It is from proxy server. It has nothing to do with repair because report won't delete any of the tables.
What operations were performed before and during the RestartThenRepairNode nemesis? Did we remove any tables during the test?
The sequences are:
- Node 10.0.42.134 runs RestartThenRepairNode
- 5 mins later
- Node 10.0.169.31 reports: Exception when communicating with 10.0.42.134: std::runtime_error (Can't find a column family with UUID f6c3c240-d892-11e9-b106-000000000005)
- The node 10.0.42.134 returned to be operational only after ~ 40 minutes
What do you mean by "the node 10.0.42.134 returned to be operational only after ~ 40 minutes."?
OK. Looking at the metrics, now I understand what does the "not operational" mean. The node 10.0.42.134 runs RestartThenRepairNode serves very few reads/writes and the cpu utilization goes up and down.
Node 10-0-42-134 restarts:
2019-09-18T07:05:09+00:00 ip-10-0-42-134 !NOTICE | kernel: Linux version 5.2.14-1.el7.elrepo.x86_64 (mockbuild@Build64R7) (gcc version 4 .8.5 20150623 (Red Hat 4.8.5-39) (GCC)) #1 SMP Tue Sep 10 10:50:19 EDT 2019
2019-09-18T07:05:18+00:00 ip-10-0-42-134 !INFO | scylla: Scylla version 3.1.0.rc7-0.20190915.024d1563a starting ...
2019-09-18T07:05:21+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] storage_service - Starting up server gossip
Node 10-0-42-134 starts to bootstrap, because it replaces a node:
2019-09-18T07:06:04+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] storage_service - JOINING: Replacing a node with token(s): {}
Node 10-0-42-134 still pulls schema:
2019-09-18T07:07:10+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] migration_manager - Requesting schema pull from 10.0.169.31:0
2019-09-18T07:07:10+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] migration_manager - Pulling schema from 10.0.169.31:0
2019-09-18T07:07:10+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] schema_tables - Creating keyspace keyspace1
2019-09-18T07:07:10+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] schema_tables - Creating keyspace keyspace_deflate
2019-09-18T07:07:10+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] schema_tables - Creating keyspace keyspace_lz4
2019-09-18T07:07:10+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] schema_tables - Creating keyspace keyspace_snappy
2019-09-18T07:07:10+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] schema_tables - Creating keyspace ks_truncate
2019-09-18T07:07:10+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] schema_tables - Creating keyspace mview
2019-09-18T07:07:10+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] schema_tables - Creating keyspace system_auth
2019-09-18T07:07:10+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] schema_tables - Creating keyspace system_distributed
2019-09-18T07:07:10+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] schema_tables - Creating keyspace system_traces
Node 10-0-42-134 finishes replace opreation
2019-09-18T07:41:07+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] storage_service - Bootstrap completed! for the tokens {}
2019-09-18T07:41:19+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] storage_service - Starting listening for CQL clients on 0.0.0.0:9042 (unencrypted)
Node 10-0-42-134 starts repair:
2019-09-18T07:43:35+00:00 ip-10-0-42-134 !INFO | scylla: [shard 0] repair - starting user-requested repair for keyspace system_traces, repair id 1, options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { primaryRange -> false}, { parallelism -> parallel}}
Node 10-0-169-31 raw the error:
2019-09-18T07:07:10+00:00 ip-10-0-169-31 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.42.134: std::runtime_error (Can't find a column family with UUID f6c3c240-d892-11e9-b106-000000000005)
Node 10-0-169-31 saw node 10.0.42.134 is down
2019-09-18T07:02:35+00:00 ip-10-0-169-31 !INFO | scylla: [shard 0] gossip - InetAddress 10.0.42.134 is now DOWN, status = shutdown
2019-09-18T07:05:22+00:00 ip-10-0-169-31 !INFO | scylla: [shard 0] gossip - InetAddress 10.0.42.134 is now DOWN, status = hibernate
2019-09-18T07:07:10+00:00 ip-10-0-169-31 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.42.134: std::runtime_error (Can't find a column family with UUID f6c3c240-d892-11e9-b106-000000000005)
2019-09-18T07:07:10+00:00 ip-10-0-169-31 !INFO | scylla: [shard 0] gossip - InetAddress 10.0.42.134 is now DOWN, status = hibernate
2019-09-18T07:41:08+00:00 ip-10-0-169-31 !INFO | scylla: [shard 0] storage_service - Node 10.0.42.134 state jump to normal
According to the log, node 10-0-42-134 is replacing it self.
$ cat sct_08_42_22_09_2019.log |grep replace_address|grep 42.134
< t:2019-09-18 07:02:27,420 f:remote.py l:213 c:sdcm.remote p:DEBUG > RemoteCmdRunner [centos@10.0.42.134]: Running command "sudo grep replace_address: /etc/scylla/scylla.yaml"...
< t:2019-09-18 07:02:28,972 f:remote.py l:108 c:sdcm.remote p:INFO > RemoteCmdRunner [centos@10.0.42.134]: Command "sudo grep replace_address: /etc/scylla/scylla.yaml" finished with status 1
< t:2019-09-18 07:02:28,992 f:remote.py l:213 c:sdcm.remote p:DEBUG > RemoteCmdRunner [centos@10.0.42.134]: Running command "sudo echo replace_address_first_boot: 10.0.42.134 >> /etc/scylla/scylla.yaml"...
< t:2019-09-18 07:02:31,971 f:remote.py l:108 c:sdcm.remote p:INFO > RemoteCmdRunner [centos@10.0.42.134]: Command "sudo echo replace_address_first_boot: 10.0.42.134 >> /etc/scylla/scylla.yaml" finished with status 0
This explains why node 10-0-42-134 servers zero cql request during 07:00 to 07:40. Because it is replacing a node during 07:00 to 07:40, it does not server only cql read/write until replace operation is done. It also explains why it find the column family issue, because at 07:07:10, the node is still in boot up process, leaning schema from the cluster. But, I can not explain why node 10-0-169-31 sent a stroage_proxy request to node 10-0-42-134 in the first place. The replacing node 10-0-42-134 will set the gossip status to HIBERNATE, 10-0-169-31 will not mark the replacing node as UP if it in HIBERNATE status. Perhaps stroage_proxy sends request to 10-0-42-134 even if it is DOWN?
@roydahan please verify this is not related to the restart node issue
The exception happened after 'scylla_create_devices' execution, right during the scylla-server start. the repair started only 36 minutes later.
$ grep 'Got new public IP\|Waiting for SSH to be up\|Waiting for DB services to be\|netstat -ln' sct.log | grep 't:2019-09-18 07:0' &
[jenkins@aws-eu-builder3 latest]$ < t:2019-09-18 07:05:07,524 f:cluster_aws.py l:430 c:sdcm.cluster p:DEBUG > Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-6 [52.214.107.136 | 10.0.42.134] (seed: False): Got new public IP 63.32.44.228
< t:2019-09-18 07:07:11,016 f:remote.py l:112 c:sdcm.remote p:ERROR > RemoteCmdRunner [centos@10.0.42.134]: Error executing command: "netstat -ln | grep :9042"; Exit status: 1
< t:2019-09-18 07:07:11,927 f:wait.py l:48 c:sdcm.wait p:DEBUG > wait_for: Retrying Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-6 [63.32.44.228 | 10.0.42.134] (seed: False): Waiting for DB services to be up: attempt 1 ended with: False
This issue occurred in https://jenkins.scylladb.com/job/scylla-3.1/job/longevity/job/longevity-100gb-4h/95/
10-07 00:06:17,645 | sdcm.nemesis.ChaosMonkey: >>>>>>>>>>>>>Started random_disrupt_method restart_then_repair_node
10-07 00:07:35,191 | RemoteCmdRunner [centos@172.30.0.146]: Running command "sudo /usr/lib/scylla/scylla-ami/scylla_create_devices"...
10-07 00:07:38,407 [centos@172.30.0.146]: Command "sudo /usr/lib/scylla/scylla-ami/scylla_create_devices" finished with status 0
10-07 00:07:47,504 | 2019-10-07T00:07:42+00:00 ip-172-30-0-11 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 172.30.0.146: std::runtime_error (Can't find a column family with UUID 3afdcef0-e884-11e9-96bb-00000000000a)
10-07 00:07:47,517 | 2019-10-07T00:07:42+00:00 ip-172-30-0-11 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 172.30.0.146: std::runtime_error (Can't find a column family with UUID 3afdcef0-e884-11e9-96bb-00000000000a)
10-07 00:07:47,528 | 2019-10-07T00:07:42+00:00 ip-172-30-0-11 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 172.30.0.146: std::runtime_error (Can't find a column family with UUID 3afdcef0-e884-11e9-96bb-00000000000a)
10-07 00:10:41,215 | RemoteCmdRunner [centos@172.30.0.146]: Running command "nodetool repair "...
10-07 00:11:49,433 | RemoteCmdRunner [centos@172.30.0.146]: Command "nodetool repair " finished with status 0
10-07 00:11:49,433 | sdcm.nemesis.ChaosMonkey: <<<<<<<<<<<<<Finished random_disrupt_method restart_then_repair_node
pushing to 3.2 - once fixed consider backports
@yarongilor / @roydahan / @amoskong does this happen in 3.2 as well - if not we will close
I didn't see anyone reporting, I think we can close.
@yarongilor / @roydahan / @amoskong does this happen in 3.2 as well - if not we will close
I didn't see this issue with 3.2
Reproduced in https://jenkins.scylladb.com/job/scylla-3.3/job/longevity/job/longevity-mv-si-4days/6
Scylla version: 3.3.rc1-0.20200223.756574d0946
Get logs: hydra investigate show-logs efda4b93-c29c-4248-a4c8-790923c084bf
The errors occurred after node restart during bootstrap and when compaction and streaming run in parallel.
Also semaphore_timed_out
errors are reported in the same time:
< t:2020-02-28 07:07:25,158 f:cluster.py l:1368 c:sdcm.cluster p:DEBUG > 2020-02-28T06:59:11+00:00 longevity-mv-si-4d-3-3-db-node-efda4b93-9 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.181.73: std::runtime_error (Can't find a column family with UUID 3cc8e850-57dd-11ea-9e81-00000000000c)
< t:2020-02-28 07:07:25,169 f:cluster.py l:1368 c:sdcm.cluster p:DEBUG > 2020-02-28T06:16:05+00:00 longevity-mv-si-4d-3-3-db-node-efda4b93-10 !WARNING | scylla: [shard 1] storage_proxy - Failed to apply mutation from 10.0.115.61#1: seastar::semaphore_timed_out (Semaphore timedout)
< t:2020-02-28 07:07:25,171 f:cluster.py l:1368 c:sdcm.cluster p:DEBUG > 2020-02-28T06:59:11+00:00 longevity-mv-si-4d-3-3-db-node-efda4b93-9 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.181.73: std::runtime_error (Can't find a column family with UUID 3cc8e850-57dd-11ea-9e81-00000000000c)
< t:2020-02-28 07:07:25,178 f:cluster.py l:1368 c:sdcm.cluster p:DEBUG > 2020-02-28T06:16:05+00:00 longevity-mv-si-4d-3-3-db-node-efda4b93-10 !WARNING | scylla: [shard 4] storage_proxy - Failed to apply mutation from 10.0.140.201#4: seastar::semaphore_timed_out (Semaphore timedout)
Repair was started later. The "Can't find a column family with UUID" error is not occurs during repair
@asias can you check the issue if somehow its related to repair running. I think we changed the code to validate the schema when repair starts (I think its part of 3.3)
We assume this is an issue stop/starting an aws node and mounting the device ...
We will create a separate issue for this.
Pushing this out to 3.4 to be verified with the fix
@juliayakovlev please verify if the issue happened BEFORE or AFTER the scylla_create_devices. If it's BEFORE and the second start was ok, it's not an issue.
reported https://github.com/scylladb/scylla/issues/6012 to stop having these errors.
scylla_create_devices was started before and finished successfully
< t:2020-02-28 06:56:57,794 f:nemesis.py l:390 c:sdcm.nemesis p:DEBUG > sdcm.nemesis.ChaosMonkey: Set current_disruption -> RestartThenRepairNode Node longevity-mv-si-4d-3-3-db-node-efda4b93-10 [34.243.145.189 | 10.0.181.73
] (seed: False)
< t:2020-02-28 06:59:02,209 f:remote.py l:286 c:sdcm.remote p:DEBUG > RemoteCmdRunner [centos@10.0.181.73]: Running command "sudo test -e /usr/lib/scylla/scylla-ami/scylla_create_devices"...
< t:2020-02-28 06:59:02,667 f:remote.py l:286 c:sdcm.remote p:DEBUG > RemoteCmdRunner [centos@10.0.181.73]: Running command "sudo test -e /opt/scylladb/scylla-ami/scylla_create_devices"...
< t:2020-02-28 06:59:03,585 f:remote.py l:152 c:sdcm.remote p:INFO > RemoteCmdRunner [centos@10.0.181.73]: Command "sudo test -e /opt/scylladb/scylla-ami/scylla_create_devices" finished with status 0
< t:2020-02-28 06:59:03,589 f:remote.py l:286 c:sdcm.remote p:DEBUG > RemoteCmdRunner [centos@10.0.181.73]: Running command "sudo /opt/scylladb/scylla-ami/scylla_create_devices"...
< t:2020-02-28 06:59:09,076 f:remote.py l:152 c:sdcm.remote p:INFO > RemoteCmdRunner [centos@10.0.181.73]: Command "sudo /opt/scylladb/scylla-ami/scylla_create_devices" finished with status 0
< t:2020-02-28 07:07:25,158 f:cluster.py l:1368 c:sdcm.cluster p:DEBUG > 2020-02-28T06:59:11+00:00 longevity-mv-si-4d-3-3-db-node-efda4b93-9 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.
0.181.73: std::runtime_error (Can't find a column family with UUID 3cc8e850-57dd-11ea-9e81-00000000000c)
< t:2020-02-28 07:07:25,171 f:cluster.py l:1368 c:sdcm.cluster p:DEBUG > 2020-02-28T06:59:11+00:00 longevity-mv-si-4d-3-3-db-node-efda4b93-9 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.
0.181.73: std::runtime_error (Can't find a column family with UUID 3cc8e850-57dd-11ea-9e81-00000000000c)
< t:2020-02-28 07:07:25,180 f:cluster.py l:1368 c:sdcm.cluster p:DEBUG > 2020-02-28T06:59:11+00:00 longevity-mv-si-4d-3-3-db-node-efda4b93-9 !ERR | scylla: [shard 7] storage_proxy - Exception when communicating with 10.
0.181.73: std::runtime_error (Can't find a column family with UUID 3cc8e850-57dd-11ea-9e81-00000000000c)
@roydahan / @juliayakovlev are we still hitting this issue in master / 4.0 - if not please close
@fgelcer please report if you see this on one of the longevity tests of 4.0. Please note it will run only on longevity that isn’t using spot instances.
@fgelcer please report if you see this on one of the longevity tests of 4.0. Please note it will run only on longevity that isn’t using spot instances.
i haven't seen this issue on 4.0... i will grep some logs now to check if this message happens, but I'm confident it doesn't.
Scylla version 4.5.dev-0.20210118.faf71c6f7 with build-id 05403082ab7eef6e982494d451f9620bf8293f58
i'm experiencing something very similar with this issue, during RestartThenRepairNode
nemesis:
[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,250 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T10:02:36+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,254 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T10:02:36+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,255 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T10:02:36+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,256 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T10:02:36+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,258 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T10:02:36+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T16:49:19.098Z] < t:2021-01-22 16:49:15,832 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T16:49:03+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-8 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T16:49:19.098Z] < t:2021-01-22 16:49:15,836 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T16:49:13+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-8 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T16:50:01.925Z] < t:2021-01-22 16:50:00,838 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T16:49:54+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-8 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,816 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,819 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,824 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,826 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_zstd.standard1: std::runtime_error (Can't find a column family with UUID 9fb99d90-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,827 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: std::runtime_error (Can't find a column family with UUID 7637da40-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,829 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_zstd.standard1: std::runtime_error (Can't find a column family with UUID 9fb99d90-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,830 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: std::runtime_error (Can't find a column family with UUID 7637da40-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,832 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,834 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_zstd.standard1: std::runtime_error (Can't find a column family with UUID 9fb99d90-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,836 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,837 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,839 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_snappy.standard1: std::runtime_error (Can't find a column family with UUID b859efd0-5c6b-11eb-b111-000000000008)
in this current case, this is happening also on a MV, but same description as the original reported one. in total, we had in this test 81K ERROR events in this test, so i retrieved manually part of the logs.
also, in another run of the same nemesis (RestartThenRepairNode
) we had backtraces too:
[2021-01-23T12:37:22.291Z] < t:2021-01-23 12:37:21,706 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-23T12:27:03+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-3 !WARNING | scylla: [shard 5] seastar - Exceptional future ignored: exceptions::mutation_write_timeout_exception (Operation timed out for mview.users_by_last_name - received only 0 responses from 1 CL=ANY.), backtrace: 0x3c73dae#012 0x3c74220#012 0x3c745a8#012 0x389f627#012 0x2492cbc#012 0x38ea26f#012 0x38eb457#012 0x3909b18#012 0x38b5bba#012 /opt/scylladb/libreloc/libpthread.so.0+0x93f8#012 /opt/scylladb/libreloc/libc.so.6+0x101902#012 --------#012 seastar::continuation<seastar::internal::promise_base_with_type<void>, db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14>(db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
--
i succeeded to collect part of the logs, and they are here
Scylla version 4.5.dev-0.20210118.faf71c6f7 with build-id 05403082ab7eef6e982494d451f9620bf8293f58
test_id = 6445e5cb-3357-43ff-9389-9bdb729d9f64
i'm experiencing something very similar with this issue, duringRestartThenRepairNode
nemesis:[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,250 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T10:02:36+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) [2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,254 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T10:02:36+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) [2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,255 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T10:02:36+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) [2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,256 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T10:02:36+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) [2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,258 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T10:02:36+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) [2021-01-22T16:49:19.098Z] < t:2021-01-22 16:49:15,832 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T16:49:03+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-8 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) [2021-01-22T16:49:19.098Z] < t:2021-01-22 16:49:15,836 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T16:49:13+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-8 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) [2021-01-22T16:50:01.925Z] < t:2021-01-22 16:50:00,838 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T16:49:54+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-8 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem) [2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,816 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003) [2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,819 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003) [2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,824 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003) [2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,826 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_zstd.standard1: std::runtime_error (Can't find a column family with UUID 9fb99d90-5c6b-11eb-b111-000000000008) [2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,827 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: std::runtime_error (Can't find a column family with UUID 7637da40-5c6b-11eb-b111-000000000008) [2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,829 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_zstd.standard1: std::runtime_error (Can't find a column family with UUID 9fb99d90-5c6b-11eb-b111-000000000008) [2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,830 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: std::runtime_error (Can't find a column family with UUID 7637da40-5c6b-11eb-b111-000000000008) [2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,832 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003) [2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,834 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_zstd.standard1: std::runtime_error (Can't find a column family with UUID 9fb99d90-5c6b-11eb-b111-000000000008) [2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,836 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003) [2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,837 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003) [2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,839 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-22T17:06:27+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_snappy.standard1: std::runtime_error (Can't find a column family with UUID b859efd0-5c6b-11eb-b111-000000000008)
in this current case, this is happening also on a MV, but same description as the original reported one. in total, we had in this test 81K ERROR events in this test, so i retrieved manually part of the logs.
also, in another run of the same nemesis (
RestartThenRepairNode
) we had backtraces too:[2021-01-23T12:37:22.291Z] < t:2021-01-23 12:37:21,706 f:file_logger.py l:78 c:sdcm.sct_events.file_logger p:INFO > 2021-01-23T12:27:03+00:00 longevity-tls-50gb-4d-master-db-node-6445e5cb-3 !WARNING | scylla: [shard 5] seastar - Exceptional future ignored: exceptions::mutation_write_timeout_exception (Operation timed out for mview.users_by_last_name - received only 0 responses from 1 CL=ANY.), backtrace: 0x3c73dae#012 0x3c74220#012 0x3c745a8#012 0x389f627#012 0x2492cbc#012 0x38ea26f#012 0x38eb457#012 0x3909b18#012 0x38b5bba#012 /opt/scylladb/libreloc/libpthread.so.0+0x93f8#012 /opt/scylladb/libreloc/libc.so.6+0x101902#012 --------#012 seastar::continuation<seastar::internal::promise_base_with_type<void>, db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14>(db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14&, seastar::future_state<seastar::internal::monostate>&&)#1}, void> --
i succeeded to collect part of the logs, and they are here
@fgelcer please check if it happened before / after the "scylla_create_devices" and if after, how long after. (Need to make sure it's not when scylla try to start while it's missing the disks).
@fgelcer please check if it happened before / after the "scylla_create_devices" and if after, how long after. (Need to make sure it's not when scylla try to start while it's missing the disks).
@roydahan , can't say now for sure as this is the run that the logs went lost because of ENOSP in the runner, but i'm believe that it is after the devices were re-created.
Similar issue received in scylla-enterprise
, branch-2020.1.9
During the RestartThenRepairNode nemesis the node longevity-100gb-4h-2020-1-db-node-17f31ab9-4 [34.247.69.27 | 10.0.36.116] has been restarted (stop and start) :
< t:2021-06-16 11:44:26,029 f:sct_events.py l:906 c:sdcm.sct_events p:INFO > 2021-06-16 11:44:26.018: (DisruptionEvent Severity.NORMAL): type=start name=RestartThenRepairNode node=Node longevity-100gb-4h-2020-1-db-node-17f31ab9-4 [34.247.69.27 | 10.0.36.116] (seed: False) duration=None
and during start node the node 4, another node (longevity-100gb-4h-2020-1-db-node-17f31ab9-3) reported two errors:
< t:2021-06-16 11:46:14,169 f:cluster.py l:1547 c:sdcm.cluster p:DEBUG > 2021-06-16T11:46:00+00:00 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.36.116: std::runtime_error (Can't find a column family with UUID ce510530-ce94-11eb-bd35-000000000005)
< t:2021-06-16 11:46:14,183 f:sct_events.py l:906 c:sdcm.sct_events p:INFO > 2021-06-16T11:46:00+00:00 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.36.116: std::runtime_error (Can't find a column family with UUID ce510530-ce94-11eb-bd35-000000000005)
From node 3:
Jun 16 11:45:50 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 scylla[1522]: [shard 0] rpc - client 10.0.36.116:7000: fail to connect: Connection refused
Jun 16 11:45:50 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 scylla[1522]: [shard 0] rpc - client 10.0.36.116:7000: fail to connect: Connection refused
Jun 16 11:45:53 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 scylla[1522]: [shard 0] rpc - client 10.0.36.116:7000: fail to connect: Connection refused
Jun 16 11:45:58 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 scylla[1522]: [shard 0] rpc - client 10.0.36.116:7000: fail to connect: Connection refused
Jun 16 11:46:00 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 scylla[1522]: [shard 0] gossip - InetAddress 10.0.36.116 is now UP, status = hibernate
Jun 16 11:46:00 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 scylla[1522]: [shard 0] features - range_tombstones: 1 xxhash: 1 in_memory: 1 result: 1
Jun 16 11:46:00 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 scylla[1522]: [shard 0] storage_proxy - Exception when communicating with 10.0.36.116: std::runtime_error (Can't find a column family with UUID ce510530-ce94-11eb-bd35-000000000005)
Because this issue is not being fixed and similar issues are being opened. I decided to disable this nemesis for now:
@asias I presume repair builds a list of tables to go over at start, so if some table is dropped while repair is running you'll get an exception like this one. As such this event is not a bug, such table should be simply skipped by repair. After all, there may be new tables which appear while repair is running, and they will not be repaired either. So if a table was dropped, it's OK to skip it as well.
@fgelcer does repair stop on this exception? @asias can we silence the exception down?
Recently some fixes went into this area: d819d98e78e1a3b861a17b7b49abf112b4b7a327 for example
@yarongilor with https://github.com/scylladb/scylladb/commit/d819d98e78e1a3b861a17b7b49abf112b4b7a327 the error should be gone - the exception is silenced. I believe you should re-try the test.
Actually that patch fixed a re-surfacing of the problem, it was fixed earlier.
Test passed ok: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-100gb-4h-test-restart-then-repair/4/ test-id: a864dc2a-a293-4212-ba17-b3f2f0fbe4f0
Rerunning in order to cover more of the nemesis occurrences: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-100gb-4h-test-restart-then-repair/5/
@fgelcer , the issue is not reproduced on above tests and the nemesis passed OK a few more times.
@fgelcer , the issue is not reproduced on above tests and the nemesis passed OK a few more times.
@yarongilor , please send a PR to re-enable the nemesis on master
Installation details Scylla version (or git commit hash): 3.1.0.rc7-0.20190915.024d1563a Cluster size: 6 OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0a01b94dffe4bcc0d
scenario: 50GB longevity. ran a nemesis of RestartThenRepairNode on 10.0.42.134. 5 minutes later got an exception from node 10.0.169.31: Exception when communicating with 10.0.42.134: std::runtime_error (Can't find a column family with UUID f6c3c240-d892-11e9-b106-000000000005) the node 10.0.42.134 returned to be operational only after ~ 40 minutes.
failure event:
nemesis scenario and failure:
logs: https://cloudius-jenkins-test.s3.amazonaws.com/5979c9a0-a1c1-4601-a2c3-63f1850766d9/longevity-tls-50gb-4d-3-1-loader-set-5979c9a0.zip https://cloudius-jenkins-test.s3.amazonaws.com/5979c9a0-a1c1-4601-a2c3-63f1850766d9/longevity-tls-50gb-4d-3-1-monitor-set-5979c9a0.zip https://cloudius-jenkins-test.s3.amazonaws.com/5979c9a0-a1c1-4601-a2c3-63f1850766d9/monitoring_data_stack_branch-2.4_3.1.tar.gz
https://cloudius-jenkins-test.s3.amazonaws.com/5979c9a0-a1c1-4601-a2c3-63f1850766d9/longevity-tls-50gb-4d-3-1-db-cluster-5979c9a0.zip
job log: scratch.scylladb.com/qa_issues/sct_08_42_22_09_2019.log.gz monitor: http://52.212.4.91:3000 snapshot: https://snapshot.raintank.io/dashboard/snapshot/rbPvId9gnqV4vX7vsFF4HkGg22c55zbh