Open fruch opened 11 months ago
Is this with consistent_cluster_management: true
? If not, concurrent schema changes are not supported by scylla and this is a known issue since forver.
"Consistency Level" is ignored by DDL statements, the driver waits for schema agreement.
Another problem is that the stress command is not prepared to handle concurrent schema setup properly:
try
{
//Keyspace
client.execute(createKeyspaceStatementCQL3(), org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);
[1] ^
client.execute("USE \""+keyspace+"\"", org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);
//Add standard1
client.execute(createStandard1StatementCQL3(settings), org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);
[2] ^
System.out.println(String.format("Created keyspaces. Sleeping %ss for propagation.", settings.node.nodes.size()));
Thread.sleep(settings.node.nodes.size() * 1000L); // seconds
}
catch (AlreadyExistsException e)
{
//Ok.
}
It catches AlreadyExistsException and assumes it's OK, but if [1] throws it, it doesn't guarantee that [2] was already executed.
Is this with
consistent_cluster_management: true
? If not, concurrent schema changes are not supported by scylla and this is a known issue since forver.
it's enable:
consistent_cluster_management: true
What's the scylla-tools version?
What's the scylla-tools version?
Version: 2022.2.dev-0.20220330.eef4cbb20a51
What's the scylla-tools version?
Version: 2022.2.dev-0.20220330.eef4cbb20a51
Probably should sync it to the scylla-tools we release (say, in latest 2022.2.x). We don't change it that often though (we just did, due to 3rd party dependency, but we also IIRC bumped up the driver version)
I think that it happens here:
client.execute("USE \""+keyspace+"\"", org.apache.cassandra.db.ConsistencyLevel.LOCAL_ONE);
this boils down to this core issue: https://github.com/scylladb/scylladb/issues/16909
and the fix for it in: https://github.com/scylladb/scylladb/pull/16969
Issue description
when multiple machines are using the same command we can end up in situation some of the command would fail like this:
since
LOCAL_ONE
is used on those functions, we might get into cases the 2nd command is run with a node that didn't yet got the update about the schema change.All those should be using QURUM, or at least same Consistency Level as the user requested in the command
Impact
cause of it, kind of randomly we have jobs that are failing on of the stress commands
How frequently does it reproduce?
Happens once in a while, not easily reproduced
Installation details
Kernel Version: 5.15.0-1050-aws Scylla version (or git commit hash):
2023.3.0~dev-20231123.45a20d3f1b34
with build-idd8a9c10e69a4a724c9eb2ac44ce1ded995fbea6e
Cluster size: 3 nodes (i3en.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-096202c204a4676eb
(aws: undefined_region)Test:
scylla-enterprise-perf-regression-latency-650gb-with-nemesis
Test id:a1a475f2-e20f-49af-86c5-1f2655194683
Test name:scylla-enterprise/scylla-enterprise-perf-regression-latency-650gb-with-nemesis
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor a1a475f2-e20f-49af-86c5-1f2655194683` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=a1a475f2-e20f-49af-86c5-1f2655194683) - Show all stored logs command: `$ hydra investigate show-logs a1a475f2-e20f-49af-86c5-1f2655194683` ## Logs: - **db-cluster-a1a475f2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a1a475f2-e20f-49af-86c5-1f2655194683/20231126_044742/db-cluster-a1a475f2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a1a475f2-e20f-49af-86c5-1f2655194683/20231126_044742/db-cluster-a1a475f2.tar.gz) - **sct-runner-a1a475f2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a1a475f2-e20f-49af-86c5-1f2655194683/20231126_044742/sct-runner-a1a475f2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a1a475f2-e20f-49af-86c5-1f2655194683/20231126_044742/sct-runner-a1a475f2.tar.gz) - **monitor-set-a1a475f2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a1a475f2-e20f-49af-86c5-1f2655194683/20231126_044742/monitor-set-a1a475f2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a1a475f2-e20f-49af-86c5-1f2655194683/20231126_044742/monitor-set-a1a475f2.tar.gz) - **loader-set-a1a475f2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a1a475f2-e20f-49af-86c5-1f2655194683/20231126_044742/loader-set-a1a475f2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a1a475f2-e20f-49af-86c5-1f2655194683/20231126_044742/loader-set-a1a475f2.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-enterprise/job/scylla-enterprise-perf-regression-latency-650gb-with-nemesis/51/) [Argus](https://argus.scylladb.com/test/6e1c3a8f-8efc-45a7-b1d4-fbebc5be7c05/runs?additionalRuns[]=a1a475f2-e20f-49af-86c5-1f2655194683)