Add fallback from shard awareness in case of overloaded coordinator shards

psarna commented 2 years ago

This issue is about implementing a mechanism for falling back to picking least busy connections, even if we know the owner shard, in case we detect that a particular coordinator shard is severely overloaded. The idea is already implemented and explained here: https://github.com/scylladb/gocql/pull/86

The whole discussion is worth reading, but specifically, in comment https://github.com/scylladb/gocql/pull/86#issuecomment-1146021940 @vladzcloudius requests implementing a similar behavior in the Java driver, so that he can conveniently test it with cassandra-stress. Hence, this issue :)

/cc @avelanarius does your team have spare cycles for implementing such a change, potentially based on existing PR for gocql linked above?

vladzcloudius commented 2 years ago

@DoronArazii Please, assign somebody to work on this.

DoronArazii commented 2 years ago

Thanks for raising this up @vladzcloudius

@mykaul please consult with @avelanarius regard this enhancement.

mykaul commented 1 year ago

Thanks for raising this up @vladzcloudius

@mykaul please consult with @avelanarius regard this enhancement.

Sure. I'll let @avelanarius asses feasibility and priority.

tarzanek commented 1 year ago

based on how many people use shard aware java driver (and to how many I told to not use it in case of shard overload) ... guys ... this is a P1 imho @mykaul @avelanarius it's low hanging fruit that will save us from lots of issues, let's get it fixed

avelanarius commented 1 year ago

What would be the test scenario? A single overloaded shard (for example due to compaction)? Or someone else would do the actual benchmarking after we implemented it?

tarzanek commented 1 year ago

@avelanarius yes, the compaction test would be good ideal test would be more clients fighting over one shard ... this way their requests would be routed away if they hit the limit of it

tarzanek commented 1 year ago

ref https://github.com/scylladb/scylladb/issues/8996

vladzcloudius commented 1 year ago

@avelanarius yes, the compaction test would be good ideal test would be more clients fighting over one shard ... this way their requests would be routed away if they hit the limit of it

Simply overload a single shard with some CPU hogging work (some tight loop C application) and run a cassandra-stress read test with RF=3.

Without a patch a corresponding amount of requests will have a higher latency due to queuing on that shard - foreground reads would go as high as you would allow: if you set threads=N then on that shard it will go as high as N.

With the fix the driver is expected to start steering reads to other shards after some threshold (AFAIR half of the max_requests_per_connection which is controlled by maxPending argument of cassandra-stress AFAIR).

avelanarius commented 1 year ago

An update on the issue: @Gor027 will start with reproducing the issue (in a clean way!) - a very similar scenario @vladzcloudius described - starting a X shard node, overloading one shard and testing sending queries that touch data in the overloaded shard. Next, test how well it performs if we always send the queries to the correct (overloaded) shard, always send data to the wrong (non-overloaded) shard. I think it'll be useful to do a "semi-deep" dive analyzing what actually happens in Scylla.

Gor027 commented 1 year ago

The attempt to reproduce the issue was as @avelanarius described above:

Start a 4-shard single node with some data
Make sure that a single CPU is mapped to a single shard
Overload one shard with a simple busy wait C code
Test how well it performs in terms of read throughput if queries(prepared statements) are always sent to the correct shard which is overloaded
Test the performance if it is sent to a wrong non-overloaded shard

The results have shown that sending to a wrong shard which is relatively not overloaded doubles the throughput.

A possible explanation is that, as the tracing of the query execution shows, when the query is sent to the wrong shard it parses and processes the statement instead of immediately redirecting the query to the correct shard which holds the data. So, some part of the query execution is done by a non-overloaded shard which is not the case with enabled shard awareness. Thus, in case of extremely high load, the behavior of picking a less overloaded shard performs better. The question is how to implement such behavior. A quick solution might be to introduce a new load-balancing policy that uses cluster metrics. Particularly, when sending a query it will pick a node that has fewer inflight requests as an indicator that it is less overloaded. I think it can be quickly implemented for Java Driver 3.x to be tested and confirmed in the field that it works better than the current version.

CC @vladzcloudius

vladzcloudius commented 1 year ago

The attempt to reproduce the issue was as @avelanarius described above:
* Start a 4-shard single node with some data

* Make sure that a single CPU is mapped to a single shard

* Overload one shard with a simple busy wait C code

* Test how well it performs in terms of read throughput if queries(prepared statements) are always sent to the correct shard which is overloaded

* Test the performance if it is sent to a wrong non-overloaded shard
The results have shown that sending to a wrong shard which is relatively not overloaded doubles the throughput.

A possible explanation is that, as the tracing of the query execution shows, when the query is sent to the wrong shard it parses and processes the statement instead of immediately redirecting the query to the correct shard which holds the data. So, some part of the query execution is done by a non-overloaded shard which is not the case with enabled shard awareness. Thus, in case of extremely high load, the behavior of picking a less overloaded shard performs better. The question is how to implement such behavior. A quick solution might be to introduce a new load-balancing policy that uses cluster metrics. Particularly, when sending a query it will pick a node that has fewer inflight requests as an indicator that it is less overloaded. I think it can be quickly implemented for Java Driver 3.x to be tested and confirmed in the field that it works better than the current version.

CC @vladzcloudius

Great. Thanks for the update, @Gor027.

The question is how to implement such behavior.

Please, see the opening message of this GH issue. It references the already merged solution for this issue in the Scylla GoCQL. There is also a reference to the RFC that was eventually led to that implementation.

Note that @avelanarius asked you to run this testing in the context of a request of implementing the same in Java (or every other Scylla driver).

avelanarius commented 1 year ago

The attempt to reproduce the issue was as @avelanarius described above:
* Start a 4-shard single node with some data

* Make sure that a single CPU is mapped to a single shard

* Overload one shard with a simple busy wait C code

* Test how well it performs in terms of read throughput if queries(prepared statements) are always sent to the correct shard which is overloaded

* Test the performance if it is sent to a wrong non-overloaded shard
The results have shown that sending to a wrong shard which is relatively not overloaded doubles the throughput. A possible explanation is that, as the tracing of the query execution shows, when the query is sent to the wrong shard it parses and processes the statement instead of immediately redirecting the query to the correct shard which holds the data. So, some part of the query execution is done by a non-overloaded shard which is not the case with enabled shard awareness. Thus, in case of extremely high load, the behavior of picking a less overloaded shard performs better. The question is how to implement such behavior. A quick solution might be to introduce a new load-balancing policy that uses cluster metrics. Particularly, when sending a query it will pick a node that has fewer inflight requests as an indicator that it is less overloaded. I think it can be quickly implemented for Java Driver 3.x to be tested and confirmed in the field that it works better than the current version. CC @vladzcloudius
Great. Thanks for the update, @Gor027.

The question is how to implement such behavior.

Please, see the opening message of this GH issue. It references the already merged solution for this issue in the Scylla GoCQL. There is also a reference to the RFC that was eventually led to that implementation.

Yes, it is already merged, but it was merged without any proper testing (not merged by me) - so it would still require some tuning. As this tuning could be tricky and time consuming, we decided to first try @fee-mendes safer idea - in driver load balancing policy pick a node+shard that has the least number of inflight requests. However, we are not saying that this is the only solution we will implement and try out.

vladzcloudius commented 1 year ago

The attempt to reproduce the issue was as @avelanarius described above:
* Start a 4-shard single node with some data

* Make sure that a single CPU is mapped to a single shard

* Overload one shard with a simple busy wait C code

* Test how well it performs in terms of read throughput if queries(prepared statements) are always sent to the correct shard which is overloaded

* Test the performance if it is sent to a wrong non-overloaded shard
The results have shown that sending to a wrong shard which is relatively not overloaded doubles the throughput. A possible explanation is that, as the tracing of the query execution shows, when the query is sent to the wrong shard it parses and processes the statement instead of immediately redirecting the query to the correct shard which holds the data. So, some part of the query execution is done by a non-overloaded shard which is not the case with enabled shard awareness. Thus, in case of extremely high load, the behavior of picking a less overloaded shard performs better. The question is how to implement such behavior. A quick solution might be to introduce a new load-balancing policy that uses cluster metrics. Particularly, when sending a query it will pick a node that has fewer inflight requests as an indicator that it is less overloaded. I think it can be quickly implemented for Java Driver 3.x to be tested and confirmed in the field that it works better than the current version. CC @vladzcloudius
Great. Thanks for the update, @Gor027.

The question is how to implement such behavior.

Please, see the opening message of this GH issue. It references the already merged solution for this issue in the Scylla GoCQL. There is also a reference to the RFC that was eventually led to that implementation.
Yes, it is already merged, but it was merged without any proper testing (not merged by me) - so it would still require some tuning. As this tuning could be tricky and time consuming, we decided to first try @fee-mendes safer idea - in driver load balancing policy pick a node+shard that has the least number of inflight requests. However, we are not saying that this is the only solution we will implement and try out.

I don't understand your point, @avelanarius How is the "Felipe's idea" different from what's described here (https://github.com/scylladb/gocql/pull/86#issuecomment-1146021940) and implemented in the GoCQL?

avelanarius commented 1 year ago

The difference is as follows. Suppose we have table in a keyspace with RF=3.

A INSERT query is executed. 3 nodes can become coordinators: A (shard a), B (shard b), C (shard c).

The GoCQL optimization (as I understand it) works that if we decide to send a query to A (shard a), but this shard (connection to it) is overloaded, we will instead send the query to A (shard leastBusyOnA). We wouldn’t change the node to send at that moment. We would send the query to a wrong shard, resulting in cross-shard ops (not saying this is bad).

The optimization that I described would kick in earlier - at the moment of choosing which replica to send the query to. We would sort the replica+shard by the numbers of inflight requests. If A+shard a was overloaded, we would instead choose B (shard b) for example (whichever had the least number of inflight requests (keeping the count per each shard in each node)). Shard b on node B would be the “correct” shard, not resulting in cross-shard ops.

vladzcloudius commented 1 year ago

The difference is as follows. Suppose we have table in a keyspace with RF=3.

A INSERT query is executed. 3 nodes can become coordinators: A (shard a), B (shard b), C (shard c).

The GoCQL optimization (as I understand it) works that if we decide to send a query to A (shard a), but this shard (connection to it) is overloaded, we will instead send the query to A (shard leastBusyOnA). We wouldn’t change the node to send at that moment. We would send the query to a wrong shard, resulting in cross-shard ops (not saying this is bad).

The optimization that I described would kick in earlier - at the moment of choosing which replica to send the query to. We would sort the replica+shard by the numbers of inflight requests. If A+shard a was overloaded, we would instead choose B (shard b) for example (whichever had the least number of inflight requests (keeping the count per each shard in each node)). Shard b on node B would be the “correct” shard, not resulting in cross-shard ops.

I see. I assumed that the optimization you are describing is already the way the coordinator is chosen.

If it's not the case this is an orthogonal issue that needs to be resolved regardless.

The optimization in question (a topic of this GH issue) deals with the situation that may happen after the logic you have described has been applied.

Let's not mix these things together - these two heuristics are meant to complete each other. However there are use cases when your heuristics alone won't help: e.g. in your example when the corresponding shard on all 3 replicas is overloaded, e.g. due to compactions running on all of them due to heavy write that wrote a lot of data into a single big partition.

avelanarius commented 1 year ago

I see. I assumed that the optimization you are describing is already the way the coordinator is chosen. If it's not the case this is an orthogonal issue that needs to be resolved regardless.

I now understand the confusion: Java Driver 4.x DOES take into account the number of in-flight requests in the load balancing policy. However, the existing implementation only compared the whole-node inflight numbers, not per-shard numbers, so I guess it isn't that sensitive to a single shard overload. Java Driver 3.x (used for example in cassandra-stress) has a naive round robin policy that doesn't take into account the in-flight request numbers.

Gor027 commented 1 year ago

A new RandomTwoChoice policy is waiting to be merged in PR #198. The policy aims to add slight optimization in a way the coordinator node is being chosen when replicas are overloaded. It takes into account the current load of each replica and shard before sending a request. The benchmarks and the results are described in detail in the PR description. You can notice how well it performs when the cluster is partially overloaded. This, however, does not fully solve the problem described in this issue. When the cluster(all nodes/replicas) is overloaded due to high writes which causes a lot of background processing like compaction, hints, etc, although the throughput remains the same according to our benchmarks, the latencies may skyrocket and timeouts may occur. So, perhaps the issue merits a separate fix and benchmarks.

vladzcloudius commented 1 year ago

I see. I assumed that the optimization you are describing is already the way the coordinator is chosen. If it's not the case this is an orthogonal issue that needs to be resolved regardless.

I now understand the confusion: Java Driver 4.x DOES take into account the number of in-flight requests in the load balancing policy. However, the existing implementation only compared the whole-node inflight numbers, not per-shard numbers, so I guess it isn't that sensitive to a single shard overload. Java Driver 3.x (used for example in cassandra-stress) has a naive round robin policy that doesn't take into account the in-flight request numbers.

Are you sure, @avelanarius? I do remember that 3.x Java drivers used a "shortest queue first" algorithm to pick a TCP connection to the same host in case there is a more than one such TCP connection - which was the case by default, and we usually made sure of that explicitly as well.

Gor027 commented 1 year ago

I see. I assumed that the optimization you are describing is already the way the coordinator is chosen. If it's not the case this is an orthogonal issue that needs to be resolved regardless.

I now understand the confusion: Java Driver 4.x DOES take into account the number of in-flight requests in the load balancing policy. However, the existing implementation only compared the whole-node inflight numbers, not per-shard numbers, so I guess it isn't that sensitive to a single shard overload. Java Driver 3.x (used for example in cassandra-stress) has a naive round robin policy that doesn't take into account the in-flight request numbers.

Are you sure, @avelanarius? I do remember that 3.x Java drivers used a "shortest queue first" algorithm to pick a TCP connection to the same host in case there is a more than one such TCP connection - which was the case by default, and we usually made sure of that explicitly as well.

I think two things are getting mixed up. The load-balancing policies in Java drivers (TokenAware, RoundRobin, etc.) only decide which nodes should handle the request. At present, those decisions are independent of the metrics about host/shard inflight requests in Java driver 3.x. However, when a particular node is chosen to handle the request, then the session manager chooses from the connection pool of that node/shard a TCP connection that has the lowest number of inflight requests. In Java driver 4.x on the other hand, the default load-balancing balancing policy takes into account the total number of inflight requests of a node when deciding which node should handle the request. The inflight request metric serves as a health check and nodes with fewer inflight requests are preferred. I think @avelanarius wanted to highlight that with a slight modification, the shard-aware driver will be capable to compare the inflight requests of the target shards of the nodes for a particular request, instead of comparing the sum of all inflight requests per node. It will allow the load-balancing policy to make better decisions and avoid sending requests to a replica with the target shard being overloaded.

vladzcloudius commented 1 year ago

@Gor027 No need to go in circles - we have established the two stages of load balancing you are referring before: https://github.com/scylladb/java-driver/issues/136#issuecomment-1347353867

This GH issue was about the load-balancing decision that is made after the target coordinator node has been chosen from the start.

And, yes, if you want to improve the algorithm of choosing the coordinator node - it has to be a matter of a separate GH issue.

Back to context of this GH issue: the idea is to choose to send a CQL request to a "wrong shard" on the chosen coordinator under certain conditions (see the opening message).

scylladb / java-driver

Add fallback from shard awareness in case of overloaded coordinator shards #136