scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 95 forks source link

c-s thread fails in `disrupt_no_corrupt_repair` nemesis while preparing a new keyspace #5966

Open ilya-rarov opened 1 year ago

ilya-rarov commented 1 year ago

Issue description

During the disrupt_no_corrupt_repair nemesis, we create keyspaces drop_table_during_repair_ks_X by triggering the corresponding c-s threads. The thread that was supposed to create the ks drop_table_during_repair_ks_2 started throwing errors and failed the very beginning

===== Using optimized driver!!! =====
Connected to cluster: longevity-tls-50gb-3d-2023-1-db-cluster-1dfc319c, max pending requests per connection null, max connections per host 8
Datatacenter: us-east; Host: /10.12.11.235; Rack: 1c
Datatacenter: us-east; Host: /10.12.11.13; Rack: 1c
Datatacenter: us-east; Host: /10.12.9.147; Rack: 1c
Datatacenter: us-east; Host: /10.12.10.222; Rack: 1c
Datatacenter: us-east; Host: /10.12.10.64; Rack: 1c
Datatacenter: us-east; Host: /10.12.11.209; Rack: 1c
WARN  16:06:56,273 Not using advanced port-based shard awareness with /10.12.11.209:9042 because we're missing port-based shard awareness port on the server
WARN  16:06:56,618 Not using advanced port-based shard awareness with /10.12.10.64:9042 because we're missing port-based shard awareness port on the server
WARN  16:06:56,752 Not using advanced port-based shard awareness with /10.12.11.235:9042 because we're missing port-based shard awareness port on the server
WARN  16:06:56,897 Not using advanced port-based shard awareness with /10.12.10.222:9042 because we're missing port-based shard awareness port on the server
WARN  16:06:56,971 Not using advanced port-based shard awareness with /10.12.11.13:9042 because we're missing port-based shard awareness port on the server
WARN  16:06:57,059 Not using advanced port-based shard awareness with /10.12.9.147:9042 because we're missing port-based shard awareness port on the server
WARN  16:07:07,751 No schema agreement from live replicas after 10 s. The schema may not be up to date on some nodes.
Created keyspaces. Sleeping 6s for propagation.
WARN  16:07:18,005 No schema agreement from live replicas after 10 s. The schema may not be up to date on some nodes.
Sleeping 2s...
Running WRITE with 200 threads for 400000 iteration
type       total ops,    op/s,    pk/s,   row/s,    mean,     med,     .95,     .99,    .999,     max,   time,   stderr, errors,  gc: #,  max ms,  sum ms,  sdv ms,      mb
Failed to connect over JMX; not collecting these stats
WARN  16:07:26,676 Unexpected error while querying ip-10-12-10-64.ec2.internal/10.12.10.64:9042 - [com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'drop_table_during_repair_ks_2' does not exist]. Find next host to query.
WARN  16:07:26,679 Unexpected error while querying ip-10-12-10-64.ec2.internal/10.12.10.64:9042 - [com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'drop_table_during_repair_ks_2' does not exist]. Find next host to query.
WARN  16:07:26,680 Unexpected error while querying ip-10-12-10-64.ec2.internal/10.12.10.64:9042 - [com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'drop_table_during_repair_ks_2' does not exist]. Find next host to query.
WARN  16:07:26,681 Unexpected error while querying ip-10-12-10-64.ec2.internal/10.12.10.64:9042 - [com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'drop_table_during_repair_ks_2' does not exist]. Find next host to query.
WARN  16:07:26,689 Unexpected error while querying ip-10-12-10-64.ec2.internal/10.12.10.64:9042 - [com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'drop_table_during_repair_ks_2' does not exist]. Find next host to query.
WARN  16:07:26,685 Unexpected error while querying ip-10-12-10-64.ec2.internal/10.12.10.64:9042 - [com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'drop_table_during_repair_ks_2' does not exist]. Find next host to query.
WARN  16:07:26,688 Unexpected error while querying ip-10-12-10-64.ec2.internal/10.12.10.64:9042 - [com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'drop_table_during_repair_ks_2' does not exist]. Find next host to query.
WARN  16:07:26,681 Unexpected error while querying ip-10-12-10-64.ec2.internal/10.12.10.64:9042 - [com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'drop_table_during_repair_ks_2' does not exist]. Find next host to query.
WARN  16:07:26,683 Unexpected error while querying ip-10-12-10-64.ec2.internal/10.12.10.64:9042 - [com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'drop_table_during_repair_ks_2' does not exist]. Find next host to query.
WARN  16:07:26,693 Unexpected error while querying ip-10-12-10-64.ec2.internal/10.12.10.64:9042 - [com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'drop_table_during_repair_ks_2' does not exist]. Find next host to query.
total,         53787,   10757,   10757,   10757,    10.2,     6.9,    29.3,    59.8,   107.6,   210.4,    5.0,  0.00000,      0,      0,       0,       0,       0,       0
WARN  16:07:26,702 Unexpected error while querying ip-10-12-10-64.ec2.internal/10.12.10.64:9042 - [com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'drop_table_during_repair_ks_2' does not exist]. Find next host to query.
WARN  16:07:26,704 Unexpected error while querying ip-10-12-10-64.ec2.internal/10.12.10.64:9042 - [com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'drop_table_during_repair_ks_2' does not exist]. Find next host to query.
WARN  16:07:26,708 Unexpected error while querying ip-10-12-10-64.ec2.internal/10.12.10.64:9042 - [com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'drop_table_during_repair_ks_2' does not exist]. Find next host to query.
WARN  16:07:26,715 Unexpected error while querying ip-10-12-10-64.ec2.internal/10.12.10.64:9042 - [com.datastax.driver.core.exceptions.InvalidQueryException: Keyspace 'drop_table_during_repair_ks_2' does not exist]. Find next host to query.

I checked in the logs of the nodes when exactly the creation of the keyspace was triggered

2023-03-15T16:06:57+00:00 longevity-tls-50gb-3d-2023-1-db-node-1dfc319c-1     !INFO | scylla[5476]:  [shard  0] schema_tables - Creating keyspace drop_table_during_repair_ks_2

2023-03-15T16:06:57+00:00 longevity-tls-50gb-3d-2023-1-db-node-1dfc319c-4     !INFO | scylla[924]:  [shard  0] schema_tables - Creating keyspace drop_table_during_repair_ks_2

2023-03-15T16:06:57+00:00 longevity-tls-50gb-3d-2023-1-db-node-1dfc319c-5     !INFO | scylla[788]:  [shard  0] schema_tables - Creating keyspace drop_table_during_repair_ks_2

2023-03-15T16:10:14+00:00 longevity-tls-50gb-3d-2023-1-db-node-1dfc319c-9     !INFO | scylla[26350]:  [shard  0] schema_tables - Creating keyspace drop_table_during_repair_ks_2

2023-03-15T16:06:57+00:00 longevity-tls-50gb-3d-2023-1-db-node-1dfc319c-10     !INFO | scylla[812]:  [shard  0] schema_tables - Creating keyspace drop_table_during_repair_ks_2

2023-03-15T16:06:57+00:00 longevity-tls-50gb-3d-2023-1-db-node-1dfc319c-12     !INFO | scylla[5477]:  [shard  0] schema_tables - Creating keyspace drop_table_during_repair_ks_2

It looks like when c-s thread queried the node-9 (10.12.10.64) the keyspace had not existed there yet. The node started creating the keyspace only at 16:10:14 (the c-s had already failed by that time).

This case might correlate with https://github.com/scylladb/scylla-cluster-tests/issues/5413

How frequently does it reproduce?

I don't know.

Installation details

Kernel Version: 5.15.0-1031-aws Scylla version (or git commit hash): 2023.1.0~rc2-20230302.726f8a090337 with build-id f8b55007dc790d0c224d5e2d4b422786ffeb918e

Cluster size: 6 nodes (im4gn.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0734389f2aed090ad (aws: us-east-1)

Test: longevity-50gb-3days-arm-test Test id: 1dfc319c-b33c-4b80-a705-e413ac1f0b1e Test name: enterprise-2023.1/longevity/longevity-50gb-3days-arm-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 1dfc319c-b33c-4b80-a705-e413ac1f0b1e` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=1dfc319c-b33c-4b80-a705-e413ac1f0b1e) - Show all stored logs command: `$ hydra investigate show-logs 1dfc319c-b33c-4b80-a705-e413ac1f0b1e` ## Logs: - **db-cluster-1dfc319c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/db-cluster-1dfc319c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/db-cluster-1dfc319c.tar.gz) - **email_data-1dfc319c.json.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/email_data-1dfc319c.json.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/email_data-1dfc319c.json.tar.gz) - **output-1dfc319c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/output-1dfc319c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/output-1dfc319c.log.tar.gz) - **debug-1dfc319c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/debug-1dfc319c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/debug-1dfc319c.log.tar.gz) - **events-1dfc319c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/events-1dfc319c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/events-1dfc319c.log.tar.gz) - **sct-1dfc319c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/sct-1dfc319c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/sct-1dfc319c.log.tar.gz) - **normal-1dfc319c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/normal-1dfc319c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/normal-1dfc319c.log.tar.gz) - **argus-1dfc319c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/argus-1dfc319c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/argus-1dfc319c.log.tar.gz) - **raw_events-1dfc319c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/raw_events-1dfc319c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/raw_events-1dfc319c.log.tar.gz) - **critical-1dfc319c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/critical-1dfc319c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/critical-1dfc319c.log.tar.gz) - **warning-1dfc319c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/warning-1dfc319c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/warning-1dfc319c.log.tar.gz) - **summary-1dfc319c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/summary-1dfc319c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/summary-1dfc319c.log.tar.gz) - **left_processes-1dfc319c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/left_processes-1dfc319c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/left_processes-1dfc319c.log.tar.gz) - **error-1dfc319c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/error-1dfc319c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/error-1dfc319c.log.tar.gz) - **monitor-set-1dfc319c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/monitor-set-1dfc319c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/monitor-set-1dfc319c.tar.gz) - **loader-set-1dfc319c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/loader-set-1dfc319c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/1dfc319c-b33c-4b80-a705-e413ac1f0b1e/20230317_175824/loader-set-1dfc319c.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2023.1/job/longevity/job/longevity-50gb-3days-arm-test/2/)
ilya-rarov commented 1 year ago

From the settings of failed c-s thread

Node:
  Nodes: [10.12.11.235, 10.12.11.13, 10.12.11.209, 10.12.10.64, 10.12.9.147, 10.12.10.222]
  Is White List: false
  Datacenter: null
Schema:
  Keyspace: drop_table_during_repair_ks_2
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
  Replication Strategy Options: {replication_factor=6}
  Table Compression: null
  Table Compaction Strategy: null
  Table Compaction Strategy Options: {}

We have 6-node cluster and RF=6. I guess, @roydahan, you suggested to change the RF, right?

fruch commented 1 year ago

So https://github.com/scylladb/scylla-cluster-tests/pull/5422 workaround didn't really wait correctly to confirm the creation of keyspace ?

I think there are few scylla issues around slow keyspace propagation, I think we should report it there.

roydahan commented 1 year ago

Seems like #5422 didn't solve the issue, because it happened within the prepare (the time between creating one ks,cf till the time c-s gave up on it).

@ilya-rarov let's try to reproduce this by running the single nemesis job of it. If it reproduces, we will open a new scylla issue. If not, we may increase the load in this job until it reproduces.