scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
58 stars 95 forks source link

c-s fails to prepare test table during `disrupt_truncate` nemesis, but the test continuous and starts the disruption #8722

Open dimakr opened 2 months ago

dimakr commented 2 months ago

At the beginning of disrupt_truncate nemesis the test ks/table are prepared with the c-s command:

< t:2024-09-14 12:46:50,787 f:stress_thread.py l:325  c:sdcm.stress_thread   p:INFO  > cassandra-stress write no-warmup n=400000 cl=QUORUM -mode native cql3  user=cassandra password=cassandra -schema keyspace=ks_truncate 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -log interval=5 -transport 'truststore=/etc/scylla/ssl_conf/truststore.jks truststore-password=cassandra' -node 10.0.0.5,10.0.0.6,10.0.0.7,10.0.0.8,10.0.0.14 -errors skip-unsupported-columns

The command fails with the error:

WARN  [cluster1-nio-worker-5] 2024-09-14 12:46:55,827 RequestHandler.java:303 - Query '[0 bound values] CREATE KEYSPACE IF NOT EXISTS "ks_truncate" WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'replication_factor' : '3'} AND durable_writes = true;' generated server side warning(s): Tables in this keyspace will be replicated using Tablets and will not support CDC, LWT and counters features. To use CDC, LWT or counters, drop this keyspace and re-create it without tablets by adding AND TABLETS = {'enabled': false} to the CREATE KEYSPACE statement.
WARN  [cluster1-worker-1] 2024-09-14 12:46:56,840 ReplicationStategy.java:204 - Error while computing token map for keyspace ks_truncate with datacenter eastus_nemesis_dc: could not achieve replication factor 3 (found 1 replicas only), check your keyspace replication settings.
WARN  [cluster1-worker-2] 2024-09-14 12:46:57,282 ReplicationStategy.java:204 - Error while computing token map for keyspace ks_truncate with datacenter eastus_nemesis_dc: could not achieve replication factor 3 (found 1 replicas only), check your keyspace replication settings.
java.lang.RuntimeException: Encountered exception creating schema
    at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:105)
    at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpaces(SettingsSchema.java:74)
    at org.apache.cassandra.stress.settings.StressSettings.maybeCreateKeyspaces(StressSettings.java:230)
    at org.apache.cassandra.stress.StressAction.run(StressAction.java:58)
    at org.apache.cassandra.stress.Stress.run(Stress.java:143)
    at org.apache.cassandra.stress.Stress.main(Stress.java:62)
Caused by: com.datastax.driver.core.exceptions.InvalidConfigurationInQueryException: Datacenter eastus_nemesis_dc doesn't have enough token-owning nodes for replication_factor=3
    at com.datastax.driver.core.exceptions.InvalidConfigurationInQueryException.copy(InvalidConfigurationInQueryException.java:38)
    at com.datastax.driver.core.exceptions.InvalidConfigurationInQueryException.copy(InvalidConfigurationInQueryException.java:27)
    at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
    at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:310)
    at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)
    at org.apache.cassandra.stress.util.JavaDriverClient.execute(JavaDriverClient.java:215)
    at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:94)
        ... 5 more         

Even though the c-s command failed the nemesis continues and starts the truncate disruption which fails with:

Command: '/usr/bin/cqlsh --no-color -u cassandra -p \'cassandra\'  --request-timeout=600 --connect-timeout=60 --ssl -e "TRUNCATE ks_truncate.standard1 USING TIMEOUT 600s" 10.0.0.8'
Exit code: 2
Stdout:
Stderr:
Warning: Using a password on the command line interface can be insecure.
Recommendation: use the credentials file to securely provide the password.
<stdin>:1:InvalidRequest: Error from server: code=2200 [Invalid query] message="unconfigured table standard1"

Installation details

Cluster size: 4 nodes (Standard_L16s_v3)

Scylla Nodes used in this run:

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/SCYLLA-IMAGES/providers/Microsoft.Compute/images/scylla-6.2.0-dev-x86_64-2024-09-13T02-56-40 (azure: undefined_region)

Test: longevity-1tb-5days-azure-test Test id: ce64f53c-084b-4445-8b62-784fa80adf1c Test name: scylla-master/tier1/longevity-1tb-5days-azure-test Test method: longevity_test.LongevityTest.test_custom_time Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor ce64f53c-084b-4445-8b62-784fa80adf1c` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=ce64f53c-084b-4445-8b62-784fa80adf1c) - Show all stored logs command: `$ hydra investigate show-logs ce64f53c-084b-4445-8b62-784fa80adf1c` ## Logs: - **core.scylla-longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-7-2024-09-14_20-04-12.gz** - [https://storage.cloud.google.com/upload.scylladb.com/core.scylla.107.3c9c5d855bb84d8ca7992470146d84ba.5030.1726344060000000./core.scylla.107.3c9c5d855bb84d8ca7992470146d84ba.5030.1726344060000000.zst](https://storage.cloud.google.com/upload.scylladb.com/core.scylla.107.3c9c5d855bb84d8ca7992470146d84ba.5030.1726344060000000./core.scylla.107.3c9c5d855bb84d8ca7992470146d84ba.5030.1726344060000000.zst) - **db-cluster-ce64f53c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/db-cluster-ce64f53c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/db-cluster-ce64f53c.tar.gz) - **sct-runner-events-ce64f53c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/sct-runner-events-ce64f53c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/sct-runner-events-ce64f53c.tar.gz) - **sct-ce64f53c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/sct-ce64f53c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/sct-ce64f53c.log.tar.gz) - **loader-set-ce64f53c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/loader-set-ce64f53c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/loader-set-ce64f53c.tar.gz) - **monitor-set-ce64f53c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/monitor-set-ce64f53c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/monitor-set-ce64f53c.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/tier1/job/longevity-1tb-5days-azure-test/34/) [Argus](https://argus.scylladb.com/test/1e333df8-a7e8-4171-8ab7-1d7bdea907d5/runs?additionalRuns[]=ce64f53c-084b-4445-8b62-784fa80adf1c)
roydahan commented 2 weeks ago

should have a quick fix to catch it, raise an error and exit the nemesis.

fruch commented 1 week ago
fruch commented 1 week ago

the 2nd point was addressed in: https://github.com/scylladb/scylla-cluster-tests/commit/dd07de65e67ea699da1c5dd7e42e9c56f9e595be

timtimb0t commented 12 hours ago

reproduced there:

Packages

Scylla version: 6.3.0~dev-20241122.e2e6f4f441be with build-id 2493a7aae1f855d3df502197f757822b6afc1033

Kernel Version: 6.8.0-1019-aws

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-001a2091244fdbdf3 ami-0f2a8365c9e541aa6 ami-0345a6812dbca92fe (aws: undefined_region)

Test: longevity-multi-dc-rack-aware-zero-token-dc-test Test id: 6d4393cc-c118-450c-a7d9-76fc5fab9e7f Test name: scylla-master/tier1/longevity-multi-dc-rack-aware-zero-token-dc-test Test method: longevity_test.LongevityTest.test_custom_time Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 6d4393cc-c118-450c-a7d9-76fc5fab9e7f` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=6d4393cc-c118-450c-a7d9-76fc5fab9e7f) - Show all stored logs command: `$ hydra investigate show-logs 6d4393cc-c118-450c-a7d9-76fc5fab9e7f` ## Logs: - **db-cluster-6d4393cc.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/db-cluster-6d4393cc.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/db-cluster-6d4393cc.tar.gz) - **sct-runner-events-6d4393cc.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/sct-runner-events-6d4393cc.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/sct-runner-events-6d4393cc.tar.gz) - **sct-6d4393cc.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/sct-6d4393cc.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/sct-6d4393cc.log.tar.gz) - **loader-set-6d4393cc.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/loader-set-6d4393cc.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/loader-set-6d4393cc.tar.gz) - **monitor-set-6d4393cc.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/monitor-set-6d4393cc.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/6d4393cc-c118-450c-a7d9-76fc5fab9e7f/20241123_083913/monitor-set-6d4393cc.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/tier1/job/longevity-multi-dc-rack-aware-zero-token-dc-test/6/) [Argus](https://argus.scylladb.com/test/7593a9ee-224a-49ec-af87-c2cdab1280d1/runs?additionalRuns[]=6d4393cc-c118-450c-a7d9-76fc5fab9e7f)
soyacz commented 10 hours ago

reproduced there:

Packages

Scylla version: 6.3.0~dev-20241122.e2e6f4f441be with build-id 2493a7aae1f855d3df502197f757822b6afc1033

Kernel Version: 6.8.0-1019-aws

Installation details

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-8 (13.60.219.172 | 10.0.1.187) (shards: 2)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-7 (13.40.120.58 | 10.3.2.153) (shards: 2)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-6 (3.8.117.118 | 10.3.0.21) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-5 (52.56.176.71 | 10.3.3.165) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-4 (18.171.61.14 | 10.3.3.77) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-3 (54.246.249.15 | 10.4.3.81) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-2 (34.244.59.175 | 10.4.0.81) (shards: 14)
  • multi-dc-rackaware-with-znode-dc-ma-db-node-6d4393cc-1 (108.129.92.105 | 10.4.0.78) (shards: 14)

OS / Image: ami-001a2091244fdbdf3 ami-0f2a8365c9e541aa6 ami-0345a6812dbca92fe (aws: undefined_region)

Test: longevity-multi-dc-rack-aware-zero-token-dc-test Test id: 6d4393cc-c118-450c-a7d9-76fc5fab9e7f Test name: scylla-master/tier1/longevity-multi-dc-rack-aware-zero-token-dc-test Test method: longevity_test.LongevityTest.test_custom_time Test config file(s):

this nemesis is not supported by znodes, fix is on the way: https://github.com/scylladb/scylla-cluster-tests/pull/9342