scylladb / gocql

Package gocql implements a fast and robust ScyllaDB client for the Go programming language.
https://docs.scylladb.com/stable/using-scylla/drivers/cql-drivers/scylla-go-driver.html
BSD 3-Clause "New" or "Revised" License
169 stars 47 forks source link

Add LOAD_BALANCING_POLICY_SLOW_AVOIDANCE funtionality #168

Closed sylwiaszunejko closed 1 month ago

sylwiaszunejko commented 3 months ago

The java driver has the feature to automatically avoid slow replicas by doing simple heuristics (https://github.com/scylladb/java-driver/blob/scylla-4.x/core/src/main/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java#L104). This is one of the key feature to have a stable latency.

This PR adds additional field in tokenAwareHostPolicy to control if the feature is enabled and what is the maximum in flight threshold.

If feature is enabled driver sorts the replicas to first try those with less than specified maximum in flight requests.

Fixes: #154

sylwiaszunejko commented 3 months ago

I am not sure how to test this, I am open for any suggestions.

roydahan commented 3 months ago

Can you please provide a reference of this functionality in the Java driver?

sylwiaszunejko commented 3 months ago

Can you please provide a reference of this functionality in the Java driver?

I added link to the code to the description.

avikivity commented 2 months ago

I'm suspicious of this policy, it can induce instability in the cluster.

  1. A node is slower than the others
  2. The policy moves work to other nodes
  3. Another node becomes slow due to work moved
  4. Slowness moves around the cluster as the policy keeps moving work around

Any action taken has to be strongly dampened to avoid overreacting. This is especially important when there are many client processes making independent, but similar decisions.

sylwiaszunejko commented 2 months ago

v2:

The process of testing this change is still in progress. I am trying to observe the behavior in use by setting up a 3-node cluster with one "slow" node (with some sleeps), aiming to show that replicas on this host are unhealthy. However, so far, I have not been able to observe higher number of in-use streams.

mykaul commented 2 months ago

If feature is enabled driver sorts the replicas to first try those with less than specified maximum in flight requests.

There was a nice addition of 'power of two choice' in one of our drivers (Java?) - I think this might be a good candidate here as well.

sylwiaszunejko commented 2 months ago

I'm suspicious of this policy, it can induce instability in the cluster.

1. A node is slower than the others

2. The policy moves work to other nodes

3. Another node becomes slow due to work moved

4. Slowness moves around the cluster as the policy keeps moving work around

Any action taken has to be strongly dampened to avoid overreacting. This is especially important when there are many client processes making independent, but similar decisions.

@avikivity I guess it is working fine for java-driver, but at the same time this behavior is not tested there so I am curious how often it is actually used and how often the node is slow. Especially because the MAX_IN_FLIGHT_THRESHOLD is not configurable in java-driver.

avikivity commented 2 months ago

I'm suspicious of this policy, it can induce instability in the cluster.

1. A node is slower than the others

2. The policy moves work to other nodes

3. Another node becomes slow due to work moved

4. Slowness moves around the cluster as the policy keeps moving work around

Any action taken has to be strongly dampened to avoid overreacting. This is especially important when there are many client processes making independent, but similar decisions.

@avikivity I guess it is working fine for java-driver, but at the same time this behavior is not tested there so I am curious how often it is actually used and how often the node is slow. Especially because the MAX_IN_FLIGHT_THRESHOLD is not configurable in java-driver.

I don't have concrete evidence that it fails, just suspicions.

The policy originated[1] with Cassandra where there is an actual source of node slowness - full garbage collection. We don't have this source, but we have others.

I heard that the "dynamic snitch" (which performs similar functionality for the coordinator->replica hop) performs poorly. See https://thelastpickle.com/blog/2019/01/30/new-cluster-recommendations.html (see 5).

MAX_IN_FLIGHT_THRESHOLD changes its meaning if we use shard-aware or shard-unaware modes, and is very workload dependent. In a cache hit intensive workload you'd see small in flight request count, in a cache miss intensive workload you'd see high in flight request count.

A node could have high latency because one of the replicas it is accessing is slow (mostly for small clusters where a single replica is 1/3 of the replicas contacted by the coordinator; for large clusters the slow replica would be averaged out).

We could add it for parity with the Java driver, but we have to be careful about recommending it.

[1] I'm just guessing

sylwiaszunejko commented 2 months ago

I managed to observe the behavior in use by setting up a 3-node cluster with one "slow" node (with some sleeps). Extra logs showed larger amount of in-use streams and the slow node was considered unhealthy.

Now I will repeat the process to see the behavior on metrics.

sylwiaszunejko commented 2 months ago

I used 3-node cluster with one slow node (with some sleeps).

It is not obvious how to observe avoiding slow replicas on metrics. If replica shuffling is disabled (default behavior) or the query happens to be LWT, there is little to no difference between gocql with or without slow replica avoidance. The difference can only be seen if the slow node happens to not be the first on the replica list.

On this graph we see number of request for different situations: avoidSlowReplicas=false, MAX_IN_FLIGHT_THRESHOLD=0, MAX_IN_FLIGHT_THRESHOLD=10, MAX_IN_FLIGHT_THRESHOLD=5, MAX_IN_FLIGHT_THRESHOLD=7 (the blue line is the slow node, the yellow one is first on the replica list most of the time, it is as fast as the green one). The test contains 20 concurrent scenarios with 10 INSERT queries with RF=3 and 500 SELECT queries with CL=1.

Screenshot from 2024-05-08 12-49-16

If driver does shuffle the replica list, enabling slow replica avoidance gives positive outcome. We can see the test execution is faster and more queries are directed to the fast nodes. (In this test there were 5 concurrent scenarios with 10 INSERT queries with RF=3 and 50 SELECT queries with CL=1)

Screenshot from 2024-05-08 19-15-12

With higher number of concurrent scenarios version without slow replica avoidance timeouts and with it works just fine:

Screenshot from 2024-05-08 20-44-42

sylwiaszunejko commented 2 months ago

During testing slow replica avoidance functionality I realized that simple queries e.g. SELECT pk, ck, v FROM keyspace1.table1 WHERE pk = x; were incorrectly marked as LWT queries and because of that there was little to no difference between gocql with or without slow replica avoidance. I submitted an issue (#174) and a PR to fix that (#173).

sylwiaszunejko commented 2 months ago

@avikivity I added some metrics to show the impact on the performance, what do you think about merging this?

avikivity commented 2 months ago

@avikivity I added some metrics to show the impact on the performance, what do you think about merging this?

https://github.com/scylladb/gocql/pull/168#issuecomment-2074588772 is not an objection to merging, since it's adding functionality already in the Java driver (opt-in I hope). I'll go over your measurements regardless.

avikivity commented 2 months ago

btw - @michoecho this can feed into our discussion re slow shards.

avikivity commented 2 months ago

Your measurement results are unsurprising - for sure if one node really is slow, then avoiding it gives better results. My worry is that we'll misdetect a node as slow, thereby diverting requests to another node, which then becomes slow, starting a feedback loop where we cause nodes to be slow.

If the detection threshold is sufficiently high, perhaps this doesn't happen.

mrsinham commented 2 months ago

Hello there, just giving a comment to give a little context. It happened that we had a very slow node, especially during upgrade process or hardware issues. This slow down with the earlier version of the driver slowed everything down, causing timeouts, write/read issue and angry customers. This PR should definitively improve those situations.