scylladb / scylla-rust-driver

Async CQL driver for Rust, optimized for ScyllaDB!
Apache License 2.0
562 stars 97 forks source link

Contact points randomization #1029

Open wprzytula opened 2 months ago

wprzytula commented 2 months ago

Problem

When driver is given an ordered lists of initial contact points, control connection is attempted to be open to nodes in that order. If the cluster is operating normally, the first node always accepts the control connection and becomes burdened with it (in a way of having to send events when triggered and topology & schema metadata when queried) until either the connected driver disconnects or the node breaks down. This imbalance should be avoided.

Solution

cpp-driver by default enables random shuffling of the initial contact points list. This ensures proper load balancing over nodes wrt control connection. We should do the same, and, similarly to cpp-driver, expose a config option to disable this behaviour (mainly useful for deterministic testing).

mykaul commented 2 months ago

Also / similar but separate issue: all shards (but 0) are also contacted in the same order. This causes a small storm when a node comes back up. I understand we have to contact shard 0 first, but the rest should be in a random order.

Lorak-mmk commented 2 months ago

Why do we have to contact shard 0 first?

wprzytula commented 2 months ago

Why do we have to contact shard 0 first?

Because we only know that nodes contain at least shard 0. No other shard's presence is guaranteed.

piodul commented 2 months ago

Because we only know that nodes contain at least shard 0. No other shard's presence is guaranteed.

The driver does not choose a shard when establishing the first connection. Usually, the first connection should be established to the non-shard-aware port and Scylla will choose the shard that is least loaded with connections.

After the first connection is made and the driver learns how many shards the node has, it will start connecting to all other shards at once (that's how it works right now).

mykaul commented 2 months ago

Because we only know that nodes contain at least shard 0. No other shard's presence is guaranteed.

Indeed.

The driver does not choose a shard when establishing the first connection. Usually, the first connection should be established to the non-shard-aware port and Scylla will choose the shard that is least loaded with connections.

True.

After the first connection is made and the driver learns how many shards the node has, it will start connecting to all other shards at once (that's how it works right now).

Yes. And it may cause (as we've seen in the past) a connection storm. As all drivers now seeing a new node is up will do the same. Ideally, it should randomize and pace the connections to all other shards. (of course, there are other optimizations I'd love to see on the connection phase, which will reduce and lessen the impact).