Open aleksbykov opened 2 months ago
CREATE KEYSPACE IF NOT EXISTS "keyspace1" WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'replication_factor' : '3'}
Isn't replication factor auto-expansion supposed to be rejected in tablets mode? cc @ptrsmrn @bhalevy
CREATE KEYSPACE IF NOT EXISTS "keyspace1" WITH replication = {'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'replication_factor' : '3'}
Isn't replication factor auto-expansion supposed to be rejected in tablets mode? cc @ptrsmrn @bhalevy
Only in ALTER KEYSPACE. It's fine in CREATE
We should reject the attempts to assign more RF than there are nodes in a given DC, as described in https://github.com/scylladb/scylladb/issues/20356. (Whether through auto-expansion or not.)
Then the initial CREATE here would fail. And the test would have to be adjusted. Which BTW you can do right away @aleksbykov so it doesn't block further testing of zero-token nodes
Still it's interesting why:
I suspect that this is somehow related to the tablets-specific routing logic in the driver. The vnodes logic somehow handles the situation (perhaps ignoring the DC and routing to other DCs?) while the tablets logic doesn't.
cc @fruch @dkropachev @sylwiaszunejko
But I'm not sure.
I also suspect that the error doesn't require zero-token nodes
Error while computing token map for keyspace keyspace1 with datacenter eu-northscylla_node_north: could not achieve replication factor 3 (found 0 replicas only), check your keyspace replication settings
it just requires the RF to be greater than the number of nodes in this DC.
Anyway I think the investigation should be started from the drivers angle, and if it turns out that there's a difference between vnodes and tablets logic, a decision should be made if we're to handle it differently in drivers, or perhaps rely on https://github.com/scylladb/scylladb/issues/20356 for a complete fix.
Assigning to the drivers team for now.
Assigning to the drivers team for now.
@dkropachev could you confirm that you agree with Karol's assignment to your team? I prefer to validate as this is P1 related to elastic cloud without much updates recently.
Assigning to the drivers team for now.
@dkropachev could you confirm that you agree with Karol's assignment to your team? I prefer to validate as this is P1 related to elastic cloud without much updates recently.
s/Karol/Kamil/
Assigning to the drivers team for now.
@dkropachev could you confirm that you agree with Karol's assignment to your team? I prefer to validate as this is P1 related to elastic cloud without much updates recently.
Agree, it is P1
s/Karol/Kamil/
Shame on me :-) Sorry.
@aleksbykov , @fruch , we are tring to indetify scope of it, could you please point me to what makes replication factor
to be auto-resize
in this test ?
Packages
Scylla version:
6.2.0~dev-20240916.870d1c16f70c
with build-idba54fa6888566c0694ad3f85d3076f346281c16c
Kernel Version:
6.8.0-1016-aws
Issue description
I have configured cluster DC1(eu-westscylla_node_west): 3(token nodes); DC2(eu-west-2scylla_node_west): 4 (3 token and 1 zerotoken nodes); DC3 (eu-northscylla_node_north): 1 zero token node.
Cluster configured with tablets enabled. As work load next c-s stress command was started:
cassandra-stress write cl=LOCAL_QUORUM n=20971520 -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -port jmx=6868 -mode cql3 native -rate threads=80 -pop seq=1..20971520 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5
But it terminated with error: Nodetool status shows only token nodes. ( known issue)
C-s log contains next errors:
It seems, that auto-resize replication factor in multidc with zero token nodes is not working with tablets.
while run with https://jenkins.scylladb.com/job/scylla-staging/job/abykov/job/longevity-multi-dc-rack-aware-zero-token-dc/11/ with exactly same config and disabled tablets, c-s reported same warning but continue its work and start send traffic:
Keyspace created with next rf and DCes):
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Cluster size: 6 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-02a7f524e91e35664 ami-0094c7e2362289d20 ami-0e00401f614e6eb3a
(aws: undefined_region)Test:
longevity-multi-dc-rack-aware-zero-token-dc
Test id:99408bc4-ff3a-48e0-9b37-65c3cf001435
Test name:scylla-staging/abykov/longevity-multi-dc-rack-aware-zero-token-dc
Test method:longevity_test.LongevityTest.test_custom_time
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 99408bc4-ff3a-48e0-9b37-65c3cf001435` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=99408bc4-ff3a-48e0-9b37-65c3cf001435) - Show all stored logs command: `$ hydra investigate show-logs 99408bc4-ff3a-48e0-9b37-65c3cf001435` ## Logs: - **db-cluster-99408bc4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/99408bc4-ff3a-48e0-9b37-65c3cf001435/20240917_161358/db-cluster-99408bc4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/99408bc4-ff3a-48e0-9b37-65c3cf001435/20240917_161358/db-cluster-99408bc4.tar.gz) - **sct-runner-events-99408bc4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/99408bc4-ff3a-48e0-9b37-65c3cf001435/20240917_161358/sct-runner-events-99408bc4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/99408bc4-ff3a-48e0-9b37-65c3cf001435/20240917_161358/sct-runner-events-99408bc4.tar.gz) - **sct-99408bc4.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/99408bc4-ff3a-48e0-9b37-65c3cf001435/20240917_161358/sct-99408bc4.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/99408bc4-ff3a-48e0-9b37-65c3cf001435/20240917_161358/sct-99408bc4.log.tar.gz) - **loader-set-99408bc4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/99408bc4-ff3a-48e0-9b37-65c3cf001435/20240917_161358/loader-set-99408bc4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/99408bc4-ff3a-48e0-9b37-65c3cf001435/20240917_161358/loader-set-99408bc4.tar.gz) - **monitor-set-99408bc4.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/99408bc4-ff3a-48e0-9b37-65c3cf001435/20240917_161358/monitor-set-99408bc4.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/99408bc4-ff3a-48e0-9b37-65c3cf001435/20240917_161358/monitor-set-99408bc4.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/abykov/job/longevity-multi-dc-rack-aware-zero-token-dc/10/) [Argus](https://argus.scylladb.com/test/bbd702fb-2f87-4b0b-a068-c2c83d74cb77/runs?additionalRuns[]=99408bc4-ff3a-48e0-9b37-65c3cf001435)