Open junglie85 opened 2 years ago
IIUC (but I'm no expert, confirmation welcome) a Shard
ID (= u32
) is specific to a node.
So to have full information on which precise shard a statement would be directed to, it would be necessary to have information on both which node the statement should be directed to, as well as which shard inside the node.
I must have missed this issue before - our driver already computes everything it needs to be shard-aware, so it's capable of assigning tokens to owner shards of a particular node. The code is available in this source file: https://github.com/scylladb/scylla-rust-driver/blob/fd06928929c0dd78a72c7aca494a998da1514a79/scylla/src/routing.rs , so I guess it's only the matter of exposing these details to the end user. Right now they're hidden quite deep in the load balancing policy implementation.
It should be feasible with combination of #484 and #485. Actually, I've opened these MR because I'm also using the shard id.
It doesn't seem like this gives access from the prepared statement + serialized values combination. I feel like that would be more practical to use, as that would enable abstracting the keyspace/table/partition_key part (especially partition key knowledge that may otherwise be unknown by application code) when dealing with sharding-related considerations.
It'd be helpful to have greater control over how records are batched. Once scenario my team has discussed is using the
Shard
ID as a means to group data for batching, irrespective of partition.The general thinking is that we can get the relevant
Token
from aPreparedStatement
by computing the partition key and passing it to thePartitioner
. To achieve this, ideallyget_partitioner_name()
would bepub
instead ofpub(crate)
onPreparedStatement
.With the
Token
, get theShard
ID from somewhere that makes sense - no good ideas on this yet, perhaps a method on theClusterData
.App specific logic can then be used to group records by the shard ID and batch them for writing to Scylla.
1. Is there any reason why this is a bad idea?
With Tablets (https://www.scylladb.com/presentations/tablets-rethinking-replication/) there is no longer a mapping between Token and shard (and such mapping was always an implementation detail) because the same partition may be located on different shards of different replicas. I'm not sure what is your use case for batches, but in general you should batch by Token, otherwise you start doing multi-partition batches, which incur more work for the coordinator node and are not a good idea afaik.
2. What's the appetite for API changes that would enable this?
Rather than expose the shard information, a shard aware batching API would be more useful. I think probably related is #448.
otherwise you start doing multi-partition batches, which incur more work for the coordinator node and are not a good idea afaik
The main issue with not batching is that going through the heavy Tokio machinery for every statement puts x5 CPU load on the server that makes the scylla calls using the Rust driver, whereas with batching that load is shared across coordinators (and probably less significant than managing Tokio futures is even if ignoring the fact that work is shared, because Tokio turned out to be heavy in my benchmarks)
otherwise you start doing multi-partition batches, which incur more work for the coordinator node and are not a good idea afaik
The main issue with not batching is that going through the heavy Tokio machinery for every statement puts x5 CPU load on the server that makes the scylla calls using the Rust driver, whereas with batching that load is shared across coordinators (and probably less significant than managing Tokio futures is even if ignoring the fact that work is shared, because Tokio turned out to be heavy in my benchmarks)
IIUC, your idea to create multi-partition batches performs faster wrt the driver, but imposes higher load on the cluster. Right? So this is a trade-off that we used to reject in the past due to our goal being to minimise load on the Scylla cluster, thus making it maximally performant.
thus making it maximally performant
It looks like at the moment the cluster may have to handle some extra networking and individual handling for every statement that might put sub-optimal load on it as well, which seems hard to control unless either one can send batches, or this can be configured to be longer (ideally into something like chunks_timeout
).
But more importantly, if we agree just that the query load is similar or greater on the client side as on the server side if we handle statements individually, for the use-case of performing High-Frequency Append-Log Style Operations like is my case and that of the author of #974, it seems clear that more scylla servers will be required to store & perform maintenance operations than client servers (~1/10 ratio), so it doesn't seem like closing APIs that would allow offload part of the work to the servers is practical, as that would not significantly reduce the number of required scylla servers (they would still be needed for maintenance operations to work properly), but would drastically increase the number of required clients.
Bear in mind that with this use-case, currently the alternative is just making batches that are not targeted, and it's not better for the servers load either.
thus making it maximally performant
It looks like at the moment the cluster may have to handle some extra networking and individual handling for every statement that might put sub-optimal load on it as well, which seems hard to control unless either one can send batches, or this can be configured to be longer (ideally into something like
chunks_timeout
).But more importantly, if we agree just that the query load is similar or greater on the client side as on the server side if we handle statements individually, for the use-case of performing High-Frequency Append-Log Style Operations like is my case and that of the author of #974, it seems clear that more scylla servers will be required to store & perform maintenance operations than client servers (~1/10 ratio), so it doesn't seem like closing APIs that would allow offload part of the work to the servers is practical, as that would not significantly reduce the number of required scylla servers (they would still be needed for maintenance operations to work properly), but would drastically increase the number of required clients.
That's convincing, at least to me.
Bear in mind that with this use-case, currently the alternative is just making batches that are not targeted, and it's not better for the servers load either.
OK, could you re-iterate what kind of API you need specifically access to in order to get such client-side performance gains?
OK, could you re-iterate what kind of API you need specifically access to in order to get such client-side performance gains?
It looks like the discussion here is joining with that of #738 so I'm just going to forward to there: https://github.com/scylladb/scylla-rust-driver/pull/738#issuecomment-2188751000
It'd be helpful to have greater control over how records are batched. Once scenario my team has discussed is using the
Shard
ID as a means to group data for batching, irrespective of partition.The general thinking is that we can get the relevant
Token
from aPreparedStatement
by computing the partition key and passing it to thePartitioner
. To achieve this, ideallyget_partitioner_name()
would bepub
instead ofpub(crate)
onPreparedStatement
.With the
Token
, get theShard
ID from somewhere that makes sense - no good ideas on this yet, perhaps a method on theClusterData
.App specific logic can then be used to group records by the shard ID and batch them for writing to Scylla.
Rather than expose the shard information, a shard aware batching API would be more useful. I think probably related is https://github.com/scylladb/scylla-rust-driver/issues/448.