scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
57 stars 95 forks source link

c-s fails to prepare test table during `disrupt_truncate` nemesis, but the test continuous and starts the disruption #8722

Open dimakr opened 1 month ago

dimakr commented 1 month ago

At the beginning of disrupt_truncate nemesis the test ks/table are prepared with the c-s command:

< t:2024-09-14 12:46:50,787 f:stress_thread.py l:325  c:sdcm.stress_thread   p:INFO  > cassandra-stress write no-warmup n=400000 cl=QUORUM -mode native cql3  user=cassandra password=cassandra -schema keyspace=ks_truncate 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -log interval=5 -transport 'truststore=/etc/scylla/ssl_conf/truststore.jks truststore-password=cassandra' -node 10.0.0.5,10.0.0.6,10.0.0.7,10.0.0.8,10.0.0.14 -errors skip-unsupported-columns

The command fails with the error:

WARN  [cluster1-nio-worker-5] 2024-09-14 12:46:55,827 RequestHandler.java:303 - Query '[0 bound values] CREATE KEYSPACE IF NOT EXISTS "ks_truncate" WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'replication_factor' : '3'} AND durable_writes = true;' generated server side warning(s): Tables in this keyspace will be replicated using Tablets and will not support CDC, LWT and counters features. To use CDC, LWT or counters, drop this keyspace and re-create it without tablets by adding AND TABLETS = {'enabled': false} to the CREATE KEYSPACE statement.
WARN  [cluster1-worker-1] 2024-09-14 12:46:56,840 ReplicationStategy.java:204 - Error while computing token map for keyspace ks_truncate with datacenter eastus_nemesis_dc: could not achieve replication factor 3 (found 1 replicas only), check your keyspace replication settings.
WARN  [cluster1-worker-2] 2024-09-14 12:46:57,282 ReplicationStategy.java:204 - Error while computing token map for keyspace ks_truncate with datacenter eastus_nemesis_dc: could not achieve replication factor 3 (found 1 replicas only), check your keyspace replication settings.
java.lang.RuntimeException: Encountered exception creating schema
    at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:105)
    at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpaces(SettingsSchema.java:74)
    at org.apache.cassandra.stress.settings.StressSettings.maybeCreateKeyspaces(StressSettings.java:230)
    at org.apache.cassandra.stress.StressAction.run(StressAction.java:58)
    at org.apache.cassandra.stress.Stress.run(Stress.java:143)
    at org.apache.cassandra.stress.Stress.main(Stress.java:62)
Caused by: com.datastax.driver.core.exceptions.InvalidConfigurationInQueryException: Datacenter eastus_nemesis_dc doesn't have enough token-owning nodes for replication_factor=3
    at com.datastax.driver.core.exceptions.InvalidConfigurationInQueryException.copy(InvalidConfigurationInQueryException.java:38)
    at com.datastax.driver.core.exceptions.InvalidConfigurationInQueryException.copy(InvalidConfigurationInQueryException.java:27)
    at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:35)
    at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:310)
    at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:58)
    at org.apache.cassandra.stress.util.JavaDriverClient.execute(JavaDriverClient.java:215)
    at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:94)
        ... 5 more         

Even though the c-s command failed the nemesis continues and starts the truncate disruption which fails with:

Command: '/usr/bin/cqlsh --no-color -u cassandra -p \'cassandra\'  --request-timeout=600 --connect-timeout=60 --ssl -e "TRUNCATE ks_truncate.standard1 USING TIMEOUT 600s" 10.0.0.8'
Exit code: 2
Stdout:
Stderr:
Warning: Using a password on the command line interface can be insecure.
Recommendation: use the credentials file to securely provide the password.
<stdin>:1:InvalidRequest: Error from server: code=2200 [Invalid query] message="unconfigured table standard1"

Installation details

Cluster size: 4 nodes (Standard_L16s_v3)

Scylla Nodes used in this run:

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/SCYLLA-IMAGES/providers/Microsoft.Compute/images/scylla-6.2.0-dev-x86_64-2024-09-13T02-56-40 (azure: undefined_region)

Test: longevity-1tb-5days-azure-test Test id: ce64f53c-084b-4445-8b62-784fa80adf1c Test name: scylla-master/tier1/longevity-1tb-5days-azure-test Test method: longevity_test.LongevityTest.test_custom_time Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor ce64f53c-084b-4445-8b62-784fa80adf1c` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=ce64f53c-084b-4445-8b62-784fa80adf1c) - Show all stored logs command: `$ hydra investigate show-logs ce64f53c-084b-4445-8b62-784fa80adf1c` ## Logs: - **core.scylla-longevity-tls-1tb-7d-master-db-node-ce64f53c-eastus-7-2024-09-14_20-04-12.gz** - [https://storage.cloud.google.com/upload.scylladb.com/core.scylla.107.3c9c5d855bb84d8ca7992470146d84ba.5030.1726344060000000./core.scylla.107.3c9c5d855bb84d8ca7992470146d84ba.5030.1726344060000000.zst](https://storage.cloud.google.com/upload.scylladb.com/core.scylla.107.3c9c5d855bb84d8ca7992470146d84ba.5030.1726344060000000./core.scylla.107.3c9c5d855bb84d8ca7992470146d84ba.5030.1726344060000000.zst) - **db-cluster-ce64f53c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/db-cluster-ce64f53c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/db-cluster-ce64f53c.tar.gz) - **sct-runner-events-ce64f53c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/sct-runner-events-ce64f53c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/sct-runner-events-ce64f53c.tar.gz) - **sct-ce64f53c.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/sct-ce64f53c.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/sct-ce64f53c.log.tar.gz) - **loader-set-ce64f53c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/loader-set-ce64f53c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/loader-set-ce64f53c.tar.gz) - **monitor-set-ce64f53c.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/monitor-set-ce64f53c.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ce64f53c-084b-4445-8b62-784fa80adf1c/20240915_045134/monitor-set-ce64f53c.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/tier1/job/longevity-1tb-5days-azure-test/34/) [Argus](https://argus.scylladb.com/test/1e333df8-a7e8-4171-8ab7-1d7bdea907d5/runs?additionalRuns[]=ce64f53c-084b-4445-8b62-784fa80adf1c)
roydahan commented 3 days ago

should have a quick fix to catch it, raise an error and exit the nemesis.

fruch commented 1 day ago
fruch commented 1 day ago

the 2nd point was addressed in: https://github.com/scylladb/scylla-cluster-tests/commit/dd07de65e67ea699da1c5dd7e42e9c56f9e595be