scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.46k stars 1.27k forks source link

storage_proxy - Exception when communicating with 10.0.42.134: std::runtime_error (Can't find a column family with UUID f6c3c240-d892-11e9-b106-000000000005) #5080

Closed yarongilor closed 1 year ago

yarongilor commented 5 years ago

Installation details Scylla version (or git commit hash): 3.1.0.rc7-0.20190915.024d1563a Cluster size: 6 OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0a01b94dffe4bcc0d

scenario: 50GB longevity. ran a nemesis of RestartThenRepairNode on 10.0.42.134. 5 minutes later got an exception from node 10.0.169.31: Exception when communicating with 10.0.42.134: std::runtime_error (Can't find a column family with UUID f6c3c240-d892-11e9-b106-000000000005) the node 10.0.42.134 returned to be operational only after ~ 40 minutes.

failure event:

(DatabaseLogEvent Severity.CRITICAL): type=DATABASE_ERROR regex=Exception  line_number=321657 node=Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-1 [54.154.21.39 | 10.0.169.31] (seed: True)
2019-09-18T07:07:10+00:00  ip-10-0-169-31 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.42.134: std::runtime_error (Can't find a column family with UUID f6c3c240-d892-11e9-b106-000000000005)

nemesis scenario and failure:

[jenkins@aws-eu-builder3 latest]$ grep -v 'ChaosMonkey on target' sct_9-18.log | grep -i 'found coredump\|Set current_disruption\|segmentation\|<<<<<<<<<\|p:ERROR > Traceback\|find a column family with UUID' 
< t:2019-09-18 02:55:23,098 f:nemesis.py      l:595  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.ChaosMonkey: <<<<<<<<<<<<<Finished random_disrupt_method destroy_data_then_repair
< t:2019-09-18 03:15:02,621 f:nemesis.py      l:595  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.ChaosMonkey: <<<<<<<<<<<<<Finished random_disrupt_method nodetool_enospc
< t:2019-09-18 03:34:41,668 f:nemesis.py      l:595  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.ChaosMonkey: <<<<<<<<<<<<<Finished random_disrupt_method nodetool_enospc
< t:2019-09-18 03:53:38,492 f:nemesis.py      l:338  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.ChaosMonkey: Set current_disruption -> ModifyTablePropertiesDclocalReadRepairChance Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-5 [34.248.223.143 | 10.0.63.76] (seed: False)
< t:2019-09-18 03:54:08,948 f:nemesis.py      l:595  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.ChaosMonkey: <<<<<<<<<<<<<Finished random_disrupt_method modify_table
< t:2019-09-18 04:13:20,674 f:nemesis.py      l:338  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.ChaosMonkey: Set current_disruption -> TruncateMonkey Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-3 [34.247.162.93 | 10.0.69.237] (seed: False)
< t:2019-09-18 04:15:29,804 f:nemesis.py      l:1071 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
< t:2019-09-18 04:35:06,021 f:nemesis.py      l:338  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.ChaosMonkey: Set current_disruption -> ModifyTablePropertiesCompaction Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-5 [34.248.223.143 | 10.0.63.76] (seed: False)
< t:2019-09-18 04:35:34,818 f:nemesis.py      l:595  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.ChaosMonkey: <<<<<<<<<<<<<Finished random_disrupt_method modify_table
< t:2019-09-18 04:54:58,207 f:nemesis.py      l:338  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.ChaosMonkey: Set current_disruption -> TerminateAndReplaceNode Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-4 [34.240.105.72 | 10.0.159.212] (seed: False)
< t:2019-09-18 06:40:22,124 f:cluster.py      l:833  c:sdcm.cluster         p:DEBUG > Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-6 [52.214.107.136 | 10.0.42.134] (seed: False): Found coredump file: /var/lib/systemd/coredump/core.scylla.996.fda02d2cfd394d57ad5130894d4813bc.64535.1568788699000000
< t:2019-09-18 06:45:46,572 f:nemesis.py      l:1071 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
< t:2019-09-18 07:02:23,258 f:nemesis.py      l:338  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.ChaosMonkey: Set current_disruption -> RestartThenRepairNode Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-6 [52.214.107.136 | 10.0.42.134] (seed: False)
< t:2019-09-18 12:07:47,684 f:nemesis.py      l:595  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.ChaosMonkey: <<<<<<<<<<<<<Finished random_disrupt_method restart_then_repair_node
< t:2019-09-18 12:24:15,634 f:nemesis.py      l:338  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.ChaosMonkey: Set current_disruption -> RestartNode with resharding Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-3 [34.247.162.93 | 10.0.69.237] (seed: False)
< t:2019-09-18 12:31:04,446 f:nemesis.py      l:1071 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
< t:2019-09-18 12:44:33,005 f:nemesis.py      l:338  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.ChaosMonkey: Set current_disruption -> RestartThenRepairNode Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-3 [34.247.162.93 | 10.0.69.237] (seed: False)
< t:2019-09-18 19:27:11,162 f:cluster.py      l:1231 c:sdcm.cluster         p:DEBUG > 2019-09-18T07:07:10+00:00  ip-10-0-169-31 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.42.134: std::runtime_error (Can't find a column family with UUID f6c3c240-d892-11e9-b106-000000000005)

logs: https://cloudius-jenkins-test.s3.amazonaws.com/5979c9a0-a1c1-4601-a2c3-63f1850766d9/longevity-tls-50gb-4d-3-1-loader-set-5979c9a0.zip https://cloudius-jenkins-test.s3.amazonaws.com/5979c9a0-a1c1-4601-a2c3-63f1850766d9/longevity-tls-50gb-4d-3-1-monitor-set-5979c9a0.zip https://cloudius-jenkins-test.s3.amazonaws.com/5979c9a0-a1c1-4601-a2c3-63f1850766d9/monitoring_data_stack_branch-2.4_3.1.tar.gz

https://cloudius-jenkins-test.s3.amazonaws.com/5979c9a0-a1c1-4601-a2c3-63f1850766d9/longevity-tls-50gb-4d-3-1-db-cluster-5979c9a0.zip

job log: scratch.scylladb.com/qa_issues/sct_08_42_22_09_2019.log.gz monitor: http://52.212.4.91:3000 snapshot: https://snapshot.raintank.io/dashboard/snapshot/rbPvId9gnqV4vX7vsFF4HkGg22c55zbh

slivne commented 5 years ago

@asias can you please have a look

bhalevy commented 5 years ago

@yarongilor I see there is also a core file:

< t:2019-09-18 06:40:22,124 f:cluster.py      l:833  c:sdcm.cluster         p:DEBUG > Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-6 [52.214.107.136 | 10.0.42.134] (seed: False): Found coredump file: /var/lib/systemd/coredump/core.scylla.996.fda02d2cfd394d57ad5130894d4813bc.64535.1568788699000000

Did you open an issue for it too?

yarongilor commented 5 years ago

this coredump was lost due to sct issue. but it may be reproduced in: https://github.com/scylladb/scylla/issues/5086

asias commented 5 years ago

The error in the log says: [shard 0] storage_proxy - Exception when communicating with 10.0.42.134: std::runtime_error (Can't find a column family with UUID f6c3c240-d892-11e9-b106-000000000005)

It is from proxy server. It has nothing to do with repair because report won't delete any of the tables.

What operations were performed before and during the RestartThenRepairNode nemesis? Did we remove any tables during the test?

The sequences are:


- Node 10.0.42.134 runs RestartThenRepairNode

- 5 mins later

- Node 10.0.169.31 reports: Exception when communicating with 10.0.42.134: std::runtime_error (Can't find a column family with UUID f6c3c240-d892-11e9-b106-000000000005)

- The node 10.0.42.134 returned to be operational only after ~ 40 minutes

What do you mean by "the node 10.0.42.134 returned to be operational only after ~ 40 minutes."?

asias commented 5 years ago

OK. Looking at the metrics, now I understand what does the "not operational" mean. The node 10.0.42.134 runs RestartThenRepairNode serves very few reads/writes and the cpu utilization goes up and down.

asias commented 5 years ago

Node 10-0-42-134 restarts:

2019-09-18T07:05:09+00:00  ip-10-0-42-134 !NOTICE  | kernel: Linux version 5.2.14-1.el7.elrepo.x86_64 (mockbuild@Build64R7) (gcc version 4        .8.5 20150623 (Red Hat 4.8.5-39) (GCC)) #1 SMP Tue Sep 10 10:50:19 EDT 2019
2019-09-18T07:05:18+00:00  ip-10-0-42-134 !INFO    | scylla: Scylla version 3.1.0.rc7-0.20190915.024d1563a starting ...
2019-09-18T07:05:21+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] storage_service - Starting up server gossip

Node 10-0-42-134 starts to bootstrap, because it replaces a node:

2019-09-18T07:06:04+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] storage_service - JOINING: Replacing a node with token(s): {}

Node 10-0-42-134 still pulls schema:

 2019-09-18T07:07:10+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] migration_manager - Requesting schema pull from 10.0.169.31:0
 2019-09-18T07:07:10+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] migration_manager - Pulling schema from 10.0.169.31:0
 2019-09-18T07:07:10+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] schema_tables - Creating keyspace keyspace1
 2019-09-18T07:07:10+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] schema_tables - Creating keyspace keyspace_deflate
 2019-09-18T07:07:10+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] schema_tables - Creating keyspace keyspace_lz4
 2019-09-18T07:07:10+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] schema_tables - Creating keyspace keyspace_snappy
 2019-09-18T07:07:10+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] schema_tables - Creating keyspace ks_truncate
 2019-09-18T07:07:10+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] schema_tables - Creating keyspace mview
 2019-09-18T07:07:10+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] schema_tables - Creating keyspace system_auth
 2019-09-18T07:07:10+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] schema_tables - Creating keyspace system_distributed
 2019-09-18T07:07:10+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] schema_tables - Creating keyspace system_traces

Node 10-0-42-134 finishes replace opreation

2019-09-18T07:41:07+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] storage_service - Bootstrap completed! for the tokens {}
2019-09-18T07:41:19+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] storage_service - Starting listening for CQL clients on 0.0.0.0:9042        (unencrypted)

Node 10-0-42-134 starts repair:

2019-09-18T07:43:35+00:00  ip-10-0-42-134 !INFO    | scylla: [shard 0] repair - starting user-requested repair for keyspace system_traces, repair id 1, options {{ trace -> false}, { jobThreads -> 1}, { incremental -> false}, { primaryRange -> false}, { parallelism -> parallel}}

Node 10-0-169-31 raw the error:

2019-09-18T07:07:10+00:00  ip-10-0-169-31 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.42.134: std::runtime_error (Can't find a column family with UUID f6c3c240-d892-11e9-b106-000000000005)

Node 10-0-169-31 saw node 10.0.42.134 is down

2019-09-18T07:02:35+00:00  ip-10-0-169-31 !INFO    | scylla: [shard 0] gossip - InetAddress 10.0.42.134 is now DOWN, status = shutdown
2019-09-18T07:05:22+00:00  ip-10-0-169-31 !INFO    | scylla: [shard 0] gossip - InetAddress 10.0.42.134 is now DOWN, status = hibernate

2019-09-18T07:07:10+00:00  ip-10-0-169-31 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.42.134: std::runtime_error (Can't find a column family with UUID f6c3c240-d892-11e9-b106-000000000005)

2019-09-18T07:07:10+00:00  ip-10-0-169-31 !INFO    | scylla: [shard 0] gossip - InetAddress 10.0.42.134 is now DOWN, status = hibernate
2019-09-18T07:41:08+00:00  ip-10-0-169-31 !INFO    | scylla: [shard 0] storage_service - Node 10.0.42.134 state jump to normal

According to the log, node 10-0-42-134 is replacing it self.

$ cat sct_08_42_22_09_2019.log |grep replace_address|grep 42.134
< t:2019-09-18 07:02:27,420 f:remote.py       l:213  c:sdcm.remote          p:DEBUG > RemoteCmdRunner [centos@10.0.42.134]: Running command "sudo grep replace_address: /etc/scylla/scylla.yaml"...
< t:2019-09-18 07:02:28,972 f:remote.py       l:108  c:sdcm.remote          p:INFO  > RemoteCmdRunner [centos@10.0.42.134]: Command "sudo grep replace_address: /etc/scylla/scylla.yaml" finished with status 1
< t:2019-09-18 07:02:28,992 f:remote.py       l:213  c:sdcm.remote          p:DEBUG > RemoteCmdRunner [centos@10.0.42.134]: Running command "sudo echo replace_address_first_boot: 10.0.42.134 >> /etc/scylla/scylla.yaml"...
< t:2019-09-18 07:02:31,971 f:remote.py       l:108  c:sdcm.remote          p:INFO  > RemoteCmdRunner [centos@10.0.42.134]: Command "sudo echo replace_address_first_boot: 10.0.42.134 >> /etc/scylla/scylla.yaml" finished with status 0

This explains why node 10-0-42-134 servers zero cql request during 07:00 to 07:40. Because it is replacing a node during 07:00 to 07:40, it does not server only cql read/write until replace operation is done. It also explains why it find the column family issue, because at 07:07:10, the node is still in boot up process, leaning schema from the cluster. But, I can not explain why node 10-0-169-31 sent a stroage_proxy request to node 10-0-42-134 in the first place. The replacing node 10-0-42-134 will set the gossip status to HIBERNATE, 10-0-169-31 will not mark the replacing node as UP if it in HIBERNATE status. Perhaps stroage_proxy sends request to 10-0-42-134 even if it is DOWN?

slivne commented 5 years ago

@roydahan please verify this is not related to the restart node issue

yarongilor commented 5 years ago

The exception happened after 'scylla_create_devices' execution, right during the scylla-server start. the repair started only 36 minutes later.

$ grep 'Got new public IP\|Waiting for SSH to be up\|Waiting for DB services to be\|netstat -ln' sct.log | grep 't:2019-09-18 07:0' &
[jenkins@aws-eu-builder3 latest]$ < t:2019-09-18 07:05:07,524 f:cluster_aws.py  l:430  c:sdcm.cluster         p:DEBUG > Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-6 [52.214.107.136 | 10.0.42.134] (seed: False): Got new public IP 63.32.44.228
< t:2019-09-18 07:07:11,016 f:remote.py       l:112  c:sdcm.remote          p:ERROR > RemoteCmdRunner [centos@10.0.42.134]: Error executing command: "netstat -ln | grep :9042"; Exit status: 1
< t:2019-09-18 07:07:11,927 f:wait.py         l:48   c:sdcm.wait            p:DEBUG > wait_for: Retrying Node longevity-tls-50gb-4d-3-1-db-node-5979c9a0-6 [63.32.44.228 | 10.0.42.134] (seed: False): Waiting for DB services to be up: attempt 1 ended with: False
amoskong commented 5 years ago

This issue occurred in https://jenkins.scylladb.com/job/scylla-3.1/job/longevity/job/longevity-100gb-4h/95/

10-07 00:06:17,645 | sdcm.nemesis.ChaosMonkey: >>>>>>>>>>>>>Started random_disrupt_method restart_then_repair_node
10-07 00:07:35,191 | RemoteCmdRunner [centos@172.30.0.146]: Running command "sudo /usr/lib/scylla/scylla-ami/scylla_create_devices"...
10-07 00:07:38,407 [centos@172.30.0.146]: Command "sudo /usr/lib/scylla/scylla-ami/scylla_create_devices" finished with status 0

10-07 00:07:47,504 | 2019-10-07T00:07:42+00:00 ip-172-30-0-11 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 172.30.0.146: std::runtime_error (Can't find a column family with UUID 3afdcef0-e884-11e9-96bb-00000000000a)
10-07 00:07:47,517 | 2019-10-07T00:07:42+00:00 ip-172-30-0-11 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 172.30.0.146: std::runtime_error (Can't find a column family with UUID 3afdcef0-e884-11e9-96bb-00000000000a)
10-07 00:07:47,528 | 2019-10-07T00:07:42+00:00 ip-172-30-0-11 !ERR | scylla: [shard 0] storage_proxy - Exception when communicating with 172.30.0.146: std::runtime_error (Can't find a column family with UUID 3afdcef0-e884-11e9-96bb-00000000000a)
10-07 00:10:41,215 | RemoteCmdRunner [centos@172.30.0.146]: Running command "nodetool repair "...
10-07 00:11:49,433 | RemoteCmdRunner [centos@172.30.0.146]: Command "nodetool repair " finished with status 0
10-07 00:11:49,433 | sdcm.nemesis.ChaosMonkey: <<<<<<<<<<<<<Finished random_disrupt_method restart_then_repair_node
slivne commented 5 years ago

pushing to 3.2 - once fixed consider backports

slivne commented 4 years ago

@yarongilor / @roydahan / @amoskong does this happen in 3.2 as well - if not we will close

roydahan commented 4 years ago

I didn't see anyone reporting, I think we can close.

amoskong commented 4 years ago

@yarongilor / @roydahan / @amoskong does this happen in 3.2 as well - if not we will close

I didn't see this issue with 3.2

juliayakovlev commented 4 years ago

Reproduced in https://jenkins.scylladb.com/job/scylla-3.3/job/longevity/job/longevity-mv-si-4days/6

Scylla version: 3.3.rc1-0.20200223.756574d0946 Get logs: hydra investigate show-logs efda4b93-c29c-4248-a4c8-790923c084bf

The errors occurred after node restart during bootstrap and when compaction and streaming run in parallel. Also semaphore_timed_out errors are reported in the same time:

< t:2020-02-28 07:07:25,158 f:cluster.py      l:1368 c:sdcm.cluster         p:DEBUG > 2020-02-28T06:59:11+00:00  longevity-mv-si-4d-3-3-db-node-efda4b93-9 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.181.73: std::runtime_error (Can't find a column family with UUID 3cc8e850-57dd-11ea-9e81-00000000000c)
< t:2020-02-28 07:07:25,169 f:cluster.py      l:1368 c:sdcm.cluster         p:DEBUG > 2020-02-28T06:16:05+00:00  longevity-mv-si-4d-3-3-db-node-efda4b93-10 !WARNING | scylla: [shard 1] storage_proxy - Failed to apply mutation from 10.0.115.61#1: seastar::semaphore_timed_out (Semaphore timedout)
< t:2020-02-28 07:07:25,171 f:cluster.py      l:1368 c:sdcm.cluster         p:DEBUG > 2020-02-28T06:59:11+00:00  longevity-mv-si-4d-3-3-db-node-efda4b93-9 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.181.73: std::runtime_error (Can't find a column family with UUID 3cc8e850-57dd-11ea-9e81-00000000000c)
< t:2020-02-28 07:07:25,178 f:cluster.py      l:1368 c:sdcm.cluster         p:DEBUG > 2020-02-28T06:16:05+00:00  longevity-mv-si-4d-3-3-db-node-efda4b93-10 !WARNING | scylla: [shard 4] storage_proxy - Failed to apply mutation from 10.0.140.201#4: seastar::semaphore_timed_out (Semaphore timedout)

Repair was started later. The "Can't find a column family with UUID" error is not occurs during repair

slivne commented 4 years ago

@asias can you check the issue if somehow its related to repair running. I think we changed the code to validate the schema when repair starts (I think its part of 3.3)

slivne commented 4 years ago

We assume this is an issue stop/starting an aws node and mounting the device ...

We will create a separate issue for this.

Pushing this out to 3.4 to be verified with the fix

bhalevy commented 4 years ago

6011 reports the same error during repair-based bootstrap

roydahan commented 4 years ago

@juliayakovlev please verify if the issue happened BEFORE or AFTER the scylla_create_devices. If it's BEFORE and the second start was ok, it's not an issue.

roydahan commented 4 years ago

reported https://github.com/scylladb/scylla/issues/6012 to stop having these errors.

juliayakovlev commented 4 years ago

scylla_create_devices was started before and finished successfully

< t:2020-02-28 06:56:57,794 f:nemesis.py      l:390  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.ChaosMonkey: Set current_disruption -> RestartThenRepairNode Node longevity-mv-si-4d-3-3-db-node-efda4b93-10 [34.243.145.189 | 10.0.181.73
] (seed: False)
< t:2020-02-28 06:59:02,209 f:remote.py       l:286  c:sdcm.remote          p:DEBUG > RemoteCmdRunner [centos@10.0.181.73]: Running command "sudo test -e /usr/lib/scylla/scylla-ami/scylla_create_devices"...
< t:2020-02-28 06:59:02,667 f:remote.py       l:286  c:sdcm.remote          p:DEBUG > RemoteCmdRunner [centos@10.0.181.73]: Running command "sudo test -e /opt/scylladb/scylla-ami/scylla_create_devices"...
< t:2020-02-28 06:59:03,585 f:remote.py       l:152  c:sdcm.remote          p:INFO  > RemoteCmdRunner [centos@10.0.181.73]: Command "sudo test -e /opt/scylladb/scylla-ami/scylla_create_devices" finished with status 0
< t:2020-02-28 06:59:03,589 f:remote.py       l:286  c:sdcm.remote          p:DEBUG > RemoteCmdRunner [centos@10.0.181.73]: Running command "sudo /opt/scylladb/scylla-ami/scylla_create_devices"...
< t:2020-02-28 06:59:09,076 f:remote.py       l:152  c:sdcm.remote          p:INFO  > RemoteCmdRunner [centos@10.0.181.73]: Command "sudo /opt/scylladb/scylla-ami/scylla_create_devices" finished with status 0
< t:2020-02-28 07:07:25,158 f:cluster.py      l:1368 c:sdcm.cluster         p:DEBUG > 2020-02-28T06:59:11+00:00  longevity-mv-si-4d-3-3-db-node-efda4b93-9 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.
0.181.73: std::runtime_error (Can't find a column family with UUID 3cc8e850-57dd-11ea-9e81-00000000000c)
< t:2020-02-28 07:07:25,171 f:cluster.py      l:1368 c:sdcm.cluster         p:DEBUG > 2020-02-28T06:59:11+00:00  longevity-mv-si-4d-3-3-db-node-efda4b93-9 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.
0.181.73: std::runtime_error (Can't find a column family with UUID 3cc8e850-57dd-11ea-9e81-00000000000c)
< t:2020-02-28 07:07:25,180 f:cluster.py      l:1368 c:sdcm.cluster         p:DEBUG > 2020-02-28T06:59:11+00:00  longevity-mv-si-4d-3-3-db-node-efda4b93-9 !ERR     | scylla: [shard 7] storage_proxy - Exception when communicating with 10.
0.181.73: std::runtime_error (Can't find a column family with UUID 3cc8e850-57dd-11ea-9e81-00000000000c)
slivne commented 4 years ago

@roydahan / @juliayakovlev are we still hitting this issue in master / 4.0 - if not please close

roydahan commented 4 years ago

@fgelcer please report if you see this on one of the longevity tests of 4.0. Please note it will run only on longevity that isn’t using spot instances.

fgelcer commented 4 years ago

@fgelcer please report if you see this on one of the longevity tests of 4.0. Please note it will run only on longevity that isn’t using spot instances.

i haven't seen this issue on 4.0... i will grep some logs now to check if this message happens, but I'm confident it doesn't.

fgelcer commented 3 years ago

Scylla version 4.5.dev-0.20210118.faf71c6f7 with build-id 05403082ab7eef6e982494d451f9620bf8293f58 i'm experiencing something very similar with this issue, during RestartThenRepairNode nemesis:

[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,250 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T10:02:36+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,254 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T10:02:36+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,255 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T10:02:36+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,256 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T10:02:36+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,258 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T10:02:36+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T16:49:19.098Z] < t:2021-01-22 16:49:15,832 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T16:49:03+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-8 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T16:49:19.098Z] < t:2021-01-22 16:49:15,836 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T16:49:13+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-8 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T16:50:01.925Z] < t:2021-01-22 16:50:00,838 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T16:49:54+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-8 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,816 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,819 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,824 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,826 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_zstd.standard1: std::runtime_error (Can't find a column family with UUID 9fb99d90-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,827 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: std::runtime_error (Can't find a column family with UUID 7637da40-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,829 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_zstd.standard1: std::runtime_error (Can't find a column family with UUID 9fb99d90-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,830 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: std::runtime_error (Can't find a column family with UUID 7637da40-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,832 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,834 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_zstd.standard1: std::runtime_error (Can't find a column family with UUID 9fb99d90-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,836 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,837 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,839 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_snappy.standard1: std::runtime_error (Can't find a column family with UUID b859efd0-5c6b-11eb-b111-000000000008)

in this current case, this is happening also on a MV, but same description as the original reported one. in total, we had in this test 81K ERROR events in this test, so i retrieved manually part of the logs.

also, in another run of the same nemesis (RestartThenRepairNode) we had backtraces too:

[2021-01-23T12:37:22.291Z] < t:2021-01-23 12:37:21,706 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-23T12:27:03+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-3 !WARNING | scylla: [shard 5] seastar - Exceptional future ignored: exceptions::mutation_write_timeout_exception (Operation timed out for mview.users_by_last_name - received only 0 responses from 1 CL=ANY.), backtrace:    0x3c73dae#012   0x3c74220#012   0x3c745a8#012   0x389f627#012   0x2492cbc#012   0x38ea26f#012   0x38eb457#012   0x3909b18#012   0x38b5bba#012   /opt/scylladb/libreloc/libpthread.so.0+0x93f8#012   /opt/scylladb/libreloc/libc.so.6+0x101902#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<void>, db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14>(db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
--

i succeeded to collect part of the logs, and they are here

fgelcer commented 3 years ago

Scylla version 4.5.dev-0.20210118.faf71c6f7 with build-id 05403082ab7eef6e982494d451f9620bf8293f58 test_id = 6445e5cb-3357-43ff-9389-9bdb729d9f64 i'm experiencing something very similar with this issue, during RestartThenRepairNode nemesis:

[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,250 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T10:02:36+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,254 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T10:02:36+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,255 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T10:02:36+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,256 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T10:02:36+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T10:02:51.453Z] < t:2021-01-22 10:02:49,258 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T10:02:36+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-4 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.0.133, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T16:49:19.098Z] < t:2021-01-22 16:49:15,832 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T16:49:03+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-8 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T16:49:19.098Z] < t:2021-01-22 16:49:15,836 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T16:49:13+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-8 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T16:50:01.925Z] < t:2021-01-22 16:50:00,838 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T16:49:54+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-8 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: seastar::named_semaphore_timed_out (Semaphore timed out: _read_concurrency_sem)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,816 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,819 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,824 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,826 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_zstd.standard1: std::runtime_error (Can't find a column family with UUID 9fb99d90-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,827 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: std::runtime_error (Can't find a column family with UUID 7637da40-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.619Z] < t:2021-01-22 17:06:28,829 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_zstd.standard1: std::runtime_error (Can't find a column family with UUID 9fb99d90-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,830 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace1.standard1: std::runtime_error (Can't find a column family with UUID 7637da40-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,832 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,834 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_zstd.standard1: std::runtime_error (Can't find a column family with UUID 9fb99d90-5c6b-11eb-b111-000000000008)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,836 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,837 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from mview.users_by_last_name: std::runtime_error (Can't find a column family with UUID 2a4eee70-5c7f-11eb-99a4-000000000003)
[2021-01-22T17:06:29.620Z] < t:2021-01-22 17:06:28,839 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-22T17:06:27+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-1 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.3.245, to read from keyspace_snappy.standard1: std::runtime_error (Can't find a column family with UUID b859efd0-5c6b-11eb-b111-000000000008)

in this current case, this is happening also on a MV, but same description as the original reported one. in total, we had in this test 81K ERROR events in this test, so i retrieved manually part of the logs.

also, in another run of the same nemesis (RestartThenRepairNode) we had backtraces too:

[2021-01-23T12:37:22.291Z] < t:2021-01-23 12:37:21,706 f:file_logger.py  l:78   c:sdcm.sct_events.file_logger p:INFO  > 2021-01-23T12:27:03+00:00  longevity-tls-50gb-4d-master-db-node-6445e5cb-3 !WARNING | scylla: [shard 5] seastar - Exceptional future ignored: exceptions::mutation_write_timeout_exception (Operation timed out for mview.users_by_last_name - received only 0 responses from 1 CL=ANY.), backtrace:    0x3c73dae#012   0x3c74220#012   0x3c745a8#012   0x389f627#012   0x2492cbc#012   0x38ea26f#012   0x38eb457#012   0x3909b18#012   0x38b5bba#012   /opt/scylladb/libreloc/libpthread.so.0+0x93f8#012   /opt/scylladb/libreloc/libc.so.6+0x101902#012   --------#012   seastar::continuation<seastar::internal::promise_base_with_type<void>, db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14>(db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, db::view::mutate_MV(dht::token const&, std::vector<frozen_mutation_and_schema, std::allocator<frozen_mutation_and_schema> >, db::view::stats&, cf_stats&, tracing::trace_state_ptr, seastar::semaphore_units<seastar::default_timeout_exception_factory, seastar::lowres_clock>, seastar::bool_class<service::allow_hints_tag>, seastar::bool_class<db::view::wait_for_all_updates_tag>)::$_14&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>
--

i succeeded to collect part of the logs, and they are here

roydahan commented 3 years ago

@fgelcer please check if it happened before / after the "scylla_create_devices" and if after, how long after. (Need to make sure it's not when scylla try to start while it's missing the disks).

fgelcer commented 3 years ago

@fgelcer please check if it happened before / after the "scylla_create_devices" and if after, how long after. (Need to make sure it's not when scylla try to start while it's missing the disks).

@roydahan , can't say now for sure as this is the run that the logs went lost because of ENOSP in the runner, but i'm believe that it is after the devices were re-created.

juliayakovlev commented 3 years ago

Similar issue received in scylla-enterprise, branch-2020.1.9

During the RestartThenRepairNode nemesis the node longevity-100gb-4h-2020-1-db-node-17f31ab9-4 [34.247.69.27 | 10.0.36.116] has been restarted (stop and start) :

< t:2021-06-16 11:44:26,029 f:sct_events.py   l:906  c:sdcm.sct_events      p:INFO  > 2021-06-16 11:44:26.018: (DisruptionEvent Severity.NORMAL): type=start name=RestartThenRepairNode node=Node longevity-100gb-4h-2020-1-db-node-17f31ab9-4 [34.247.69.27 | 10.0.36.116] (seed: False) duration=None

and during start node the node 4, another node (longevity-100gb-4h-2020-1-db-node-17f31ab9-3) reported two errors:

< t:2021-06-16 11:46:14,169 f:cluster.py      l:1547 c:sdcm.cluster         p:DEBUG > 2021-06-16T11:46:00+00:00  longevity-100gb-4h-2020-1-db-node-17f31ab9-3 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.36.116: std::runtime_error (Can't find a column family with UUID ce510530-ce94-11eb-bd35-000000000005)
< t:2021-06-16 11:46:14,183 f:sct_events.py   l:906  c:sdcm.sct_events      p:INFO  > 2021-06-16T11:46:00+00:00  longevity-100gb-4h-2020-1-db-node-17f31ab9-3 !ERR     | scylla: [shard 0] storage_proxy - Exception when communicating with 10.0.36.116: std::runtime_error (Can't find a column family with UUID ce510530-ce94-11eb-bd35-000000000005)

From node 3:

Jun 16 11:45:50 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 scylla[1522]:  [shard 0] rpc - client 10.0.36.116:7000: fail to connect: Connection refused
Jun 16 11:45:50 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 scylla[1522]:  [shard 0] rpc - client 10.0.36.116:7000: fail to connect: Connection refused
Jun 16 11:45:53 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 scylla[1522]:  [shard 0] rpc - client 10.0.36.116:7000: fail to connect: Connection refused
Jun 16 11:45:58 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 scylla[1522]:  [shard 0] rpc - client 10.0.36.116:7000: fail to connect: Connection refused
Jun 16 11:46:00 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 scylla[1522]:  [shard 0] gossip - InetAddress 10.0.36.116 is now UP, status = hibernate
Jun 16 11:46:00 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 scylla[1522]:  [shard 0] features - range_tombstones: 1 xxhash: 1 in_memory: 1 result: 1
Jun 16 11:46:00 longevity-100gb-4h-2020-1-db-node-17f31ab9-3 scylla[1522]:  [shard 0] storage_proxy - Exception when communicating with 10.0.36.116: std::runtime_error (Can't find a column family with UUID ce510530-ce94-11eb-bd35-000000000005)
roydahan commented 3 years ago

Because this issue is not being fixed and similar issues are being opened. I decided to disable this nemesis for now:

https://github.com/scylladb/scylla-cluster-tests/pull/3698

kostja commented 1 year ago

@asias I presume repair builds a list of tables to go over at start, so if some table is dropped while repair is running you'll get an exception like this one. As such this event is not a bug, such table should be simply skipped by repair. After all, there may be new tables which appear while repair is running, and they will not be repaired either. So if a table was dropped, it's OK to skip it as well.

@fgelcer does repair stop on this exception? @asias can we silence the exception down?

avikivity commented 1 year ago

Recently some fixes went into this area: d819d98e78e1a3b861a17b7b49abf112b4b7a327 for example

kostja commented 1 year ago

@yarongilor with https://github.com/scylladb/scylladb/commit/d819d98e78e1a3b861a17b7b49abf112b4b7a327 the error should be gone - the exception is silenced. I believe you should re-try the test.

avikivity commented 1 year ago

Actually that patch fixed a re-surfacing of the problem, it was fixed earlier.

yarongilor commented 1 year ago

Tested in: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-100gb-4h-test-restart-then-repair/3/

yarongilor commented 1 year ago

Test passed ok: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-100gb-4h-test-restart-then-repair/4/ test-id: a864dc2a-a293-4212-ba17-b3f2f0fbe4f0

Rerunning in order to cover more of the nemesis occurrences: https://jenkins.scylladb.com/job/scylla-staging/job/yarongilor/job/longevity-100gb-4h-test-restart-then-repair/5/

yarongilor commented 1 year ago

@fgelcer , the issue is not reproduced on above tests and the nemesis passed OK a few more times.

fgelcer commented 1 year ago

@fgelcer , the issue is not reproduced on above tests and the nemesis passed OK a few more times.

@yarongilor , please send a PR to re-enable the nemesis on master