scylladb / gocql

Package gocql implements a fast and robust ScyllaDB client for the Go programming language.
https://docs.scylladb.com/stable/using-scylla/drivers/cql-drivers/scylla-go-driver.html
BSD 3-Clause "New" or "Revised" License
188 stars 59 forks source link

Ensure gocql handles zero-token nodes properly #226

Open dkropachev opened 3 months ago

dkropachev commented 3 months ago

PR#19684 brings possibility of having nodes coordinator-only nodes (or zero-token nodes). These types of nodes are going to be supported only in RAFT.

Such nodes, despite being registered in the cluster, do not handle any queries and should be excluded from query routing. This feature is already present in cassandra, but not merged into scylla yet, so we might want to start testing it on our drivers with cassandra first.

Difference between cassandra and scylla implementation

Major difference is that these nodes are absent from system.peers and system.peers_v2 in cassandra, while in scylla implementation these nodes are going to be present there.

Due to this fact we will need to test Apache and datastax drivers against scylla as well.

Approx. Testing plan

Regular cluster

  1. Spin up a cluster with 3 nodes
  2. Join one additional node in zero-token mode, by setting join_ring to false in it's configuration, or adding -Dcassandra.join_ring=false to cli (cassandra only).
  3. Make sure that drivers works as expected and do not throw any errors while reading schema with this node being in the cluster
  4. Make sure that drivers works as expected and do not throw any errors while processing topology events (if these events issues) when such node joins/leaves cluster.
  5. Make sure that zero-token node does not participate in the routing
  6. Test if driver works properly if only connection point provided is zero-token node
  7. Ensure that at no point driver throw error or warning caused by zero-token node presence.

Cluster that starts with zero-token node (DROPPED)

  1. Start single node cluster with join_ring=false
  2. Connect to it, to make sure that driver session is created and every query end up in no host available error.
  3. Populate cluster with 3 more nodes
  4. Make sure that driver can execute queries
  5. Ensure that at no point driver throw error or warning.

Zero-token Datacenter

Repeat this scenario for following policies:

  1. DCAwareRoundRobinPolicy
  2. TokenAwareHostPolicy(DCAwareRoundRobinPolicy())
  3. TokenAwareHostPolicy(RoundRobinHostPolicy())

For DCAwareRoundRobinPolicy use three variants:

  1. Target first DC with real nodes
  2. Target second DC with zero token nodes
  3. (For drivers that supports it, gocql does not) Do not target any DC, make sure that policy won't pick datacenter with no real nodes.

Steps:

  1. Start cluster of 2 nodes with 1 DC
  2. Provision 2 more nodes into 2nd DC in join_ring=false mode
  3. Connect to the cluster, using policy to make sure that driver session is created and every query is being scheduled to regular nodes and executed successfully. In cases when zero-token DC is targeted queries suppose to fail with no host available error

Links

Original umbrella issue in scylladb/scylladb repo: https://github.com/scylladb/scylladb/issues/19693 Core issue to bring join_ring option into scylla: https://github.com/scylladb/scylladb/issues/6527 PR that brings this feature in https://github.com/scylladb/scylladb/pull/19684

sylwiaszunejko commented 2 weeks ago

@dkropachev I created a PR for things I discover when testing first scenario, but second scenario is impossible to use because when I try to start single node cluster with join_ring=false I have an error: ERROR 2024-10-30 11:37:46,746 [shard 0:main] init - Startup failed: std::runtime_error (Cannot start the first node in the cluster as zero-token)

dkropachev commented 1 week ago

@dkropachev I created a PR for things I discover when testing first scenario, but second scenario is impossible to use because when I try to start single node cluster with join_ring=false I have an error: ERROR 2024-10-30 11:37:46,746 [shard 0:main] init - Startup failed: std::runtime_error (Cannot start the first node in the cluster as zero-token)

Thanks, it looks like it is imposible, let's focuse then on zero-token DC case

sylwiaszunejko commented 1 week ago

In cases when zero-token DC is targeted queries suppose to fail with no host available error

@dkropachev Is is ok if it fails with error like this: 2024/11/05 12:44:16 Unable to connect to cluster: gocql: unable to create session: gocql: datacenter datacenter2 in the policy was not found in the topology - probable DC aware policy misconfiguration when using DCAwareRoundRobinPolicy(zero_token_database). Except for that I didn't find any incorrect behavior related to zero-token nodes.

sylwiaszunejko commented 6 days ago

@dkropachev ping

dkropachev commented 5 days ago

@sylwiaszunejko , It needs some context, but I am looking for following scenarios datacenter2 is a zero-token datacenter target host - host you feed to NewCluster target dc - dc name you feed to DCAwareRoundRobinPolicy

  1. target host = any host from datacenter1, target dc = datacenter1. It should succeed, you should be able to execute queries
  2. target host = any host from datacenter2, target dc = datacenter1. It should succeed, you should be able to execute queries
  3. target host = any host from datacenter1, target dc = datacenter2. It should fail with same error you have provided
  4. target host = any host from datacenter2, target dc = datacenter2. It should fail with same error you have provided
roydahan commented 2 days ago

Let's make sure we add a unit test for it.