Open rukai opened 1 year ago
Hi @rukai There is a PR active to manually set the topology refresh interval. https://github.com/scylladb/scylla-rust-driver/pull/776
However, other changes are not addressed in that pull request.
Oh, sorry for duplicating your pr, I'll go ahead and close mine.
It would make semantic sense to be able to set None in order to completely disable the refresh but I can always just set the value to 1000000 for effectively the same result.
@rukai Could you describe the setup of your benchmark in more detail?
I'm not able to reproduce the throughput dip around 60s. For my testing, I'm using cql-stress which uses Rust Driver, ./target/release/cql-stress-scylla-bench -mode write -workload sequential -nodes "127.0.0.1:9042" --partition-count 1000000
. I tried it with both Scylla 5.4.0 (master) and Cassandra 3.11.15 and I don't see a throughput dip:
I tested my bencher again and observed that it doesnt happen locally, only when run on AWS. Maybe the smaller node size, or the extra latency is the cause? But I'll do some more investigation myself, when I get the chance.
I have a 3 node cassandra cluster running on 3 aws m6a.large instances I have a bencher running on aws m6a.large instance.
You should be able to reproduce with:
git clone https://github.com/shotover/shotover-proxy
cd shotover-proxy
cargo windsock --cloud --name cassandra,compression=none,driver=scylla,operation=write_blob,protocol=v4,shotover=none,topology=cluster3 --bench-length-seconds 60
BIG WARNING THOUGH: this will create amazon EC2 instances if you have AWS credentials setup on your machine.
It will attempt to clean up after itself but you should make sure that it succeeds, possibly running cargo windsock --cleanup-cloud-resources
to force a cleanup if it panics midway through.
If that sounds scary, fair enough, maybe just setup your own bench manually on cloud infrastructure and see if you can reproduce that way.
The throughput drop happens at about 42s into the bench since the driver is started before benching starts.
I am using scylla-rust-driver as the driver in a cassandra benchmark. When that benchmark reaches 60s passed, throughput roughly halves for a few seconds and I have tracked this down to the metadata refresh. I am able to eliminate the loss in throughput by changing this to a very large number : https://github.com/scylladb/scylla-rust-driver/blob/4efc84dfbc7bb204b49a8564378537e35cfe3ad1/scylla/src/transport/cluster.rs#L485`
I would like to raise two issues as a result:
issue 1
The metadata refresh should be made more performant. I did a quick investigation and found it was
MetadataReader::read_metadata
that was impacting throughput. The atomic swapping of metadata results seems to be working fine as removing the swap did not improve throughput in anyway. I'm not sure if the cause is cassandra slowing down, the client slowing down due to running queries or the client slowing down due to processing results of queries, or a combination of these or something else entirely.So I think it would be a good idea for the scylla-rust-driver team to give this a thorough investigation as it seems like this would cause a dip in production performance every 60s. However for the needs of my project I think all I will need is what I describe in issue 2
issue 2
We need a way to disable and/or change the timing of the metadata refresh. As I am writing a benchmark from which I can guarantee no other client is altering the schema or topology, I would like a way to completely disable such background work so I can evaluate the average throughput alone.