metadata refresh is really expensive!

rukai commented 1 year ago

I am using scylla-rust-driver as the driver in a cassandra benchmark. When that benchmark reaches 60s passed, throughput roughly halves for a few seconds and I have tracked this down to the metadata refresh. I am able to eliminate the loss in throughput by changing this to a very large number : https://github.com/scylladb/scylla-rust-driver/blob/4efc84dfbc7bb204b49a8564378537e35cfe3ad1/scylla/src/transport/cluster.rs#L485`

I would like to raise two issues as a result:

issue 1

The metadata refresh should be made more performant. I did a quick investigation and found it was MetadataReader::read_metadata that was impacting throughput. The atomic swapping of metadata results seems to be working fine as removing the swap did not improve throughput in anyway. I'm not sure if the cause is cassandra slowing down, the client slowing down due to running queries or the client slowing down due to processing results of queries, or a combination of these or something else entirely.

So I think it would be a good idea for the scylla-rust-driver team to give this a thorough investigation as it seems like this would cause a dip in production performance every 60s. However for the needs of my project I think all I will need is what I describe in issue 2

issue 2

We need a way to disable and/or change the timing of the metadata refresh. As I am writing a benchmark from which I can guarantee no other client is altering the schema or topology, I would like a way to completely disable such background work so I can evaluate the average throughput alone.

rishabharyal commented 1 year ago

Hi @rukai There is a PR active to manually set the topology refresh interval. https://github.com/scylladb/scylla-rust-driver/pull/776

However, other changes are not addressed in that pull request.

rukai commented 1 year ago

Oh, sorry for duplicating your pr, I'll go ahead and close mine.

It would make semantic sense to be able to set None in order to completely disable the refresh but I can always just set the value to 1000000 for effectively the same result.

avelanarius commented 1 year ago

@rukai Could you describe the setup of your benchmark in more detail?

I'm not able to reproduce the throughput dip around 60s. For my testing, I'm using cql-stress which uses Rust Driver, ./target/release/cql-stress-scylla-bench -mode write -workload sequential -nodes "127.0.0.1:9042" --partition-count 1000000. I tried it with both Scylla 5.4.0 (master) and Cassandra 3.11.15 and I don't see a throughput dip:

Logs

``` $ ./target/release/cql-stress-scylla-bench -mode write -workload sequential -nodes "127.0.0.1:9042" --partition-count 1000000 Configuration Mode: write Workload: sequential Timeout: 5.0s Consistency level: quorum Partition count: 1000000 Clustering rows: 100 Clustering row size: Fixed(4) Rows per request: 1 Page size: 1000 Concurrency: 16 Maximum rate: unlimited Client compression: true time ops/s rows/s errors max 99.9th 99th 95th 90th median mean 1.0s 62790 62790 0 1.60ms 716μs 442μs 354μs 326μs 248μs 254μs 2.0s 70961 70961 0 36.1ms 549μs 322μs 290μs 274μs 214μs 225μs 3.0s 70589 70589 0 30.9ms 424μs 331μs 295μs 278μs 217μs 226μs 4.0s 70173 70173 0 22.7ms 414μs 331μs 297μs 280μs 220μs 227μs 5.0s 71800 71800 0 23.7ms 934μs 396μs 291μs 272μs 212μs 222μs 6.0s 74249 74249 0 1.95ms 412μs 317μs 287μs 271μs 212μs 215μs 7.0s 71489 71489 0 29.4ms 351μs 316μs 288μs 273μs 215μs 223μs 8.0s 75378 75378 0 718μs 349μs 313μs 284μs 268μs 209μs 211μs 9.0s 71557 71557 0 36.0ms 820μs 337μs 287μs 271μs 211μs 223μs 10.0s 72476 72476 0 35.0ms 388μs 315μs 284μs 269μs 210μs 220μs 11.0s 74803 74803 0 1.46ms 358μs 313μs 284μs 269μs 211μs 213μs 12.0s 72413 72413 0 34.3ms 369μs 319μs 287μs 270μs 210μs 220μs 13.0s 74634 74634 0 1.24ms 359μs 314μs 285μs 270μs 212μs 213μs 14.0s 71706 71706 0 35.6ms 370μs 316μs 287μs 272μs 213μs 222μs 15.0s 71063 71063 0 34.0ms 755μs 346μs 290μs 273μs 213μs 224μs 16.0s 73064 73064 0 1.45ms 397μs 323μs 292μs 276μs 216μs 218μs 17.0s 72503 72503 0 33.4ms 368μs 316μs 285μs 270μs 210μs 220μs 18.0s 73485 73485 0 1.34ms 352μs 316μs 287μs 272μs 213μs 214μs 19.0s 71586 71586 0 34.8ms 411μs 332μs 294μs 276μs 215μs 225μs 20.0s 70933 70933 0 34.2ms 547μs 327μs 290μs 274μs 214μs 225μs 21.0s 72102 72102 0 515μs 375μs 325μs 296μs 281μs 219μs 221μs 22.0s 70153 70153 0 32.6ms 880μs 338μs 295μs 278μs 216μs 227μs 23.0s 72741 72741 0 1.35ms 352μs 311μs 284μs 270μs 211μs 213μs 24.0s 71337 71337 0 34.5ms 403μs 329μs 297μs 281μs 220μs 230μs 25.0s 71712 71712 0 35.0ms 378μs 317μs 288μs 272μs 212μs 222μs 26.0s 73984 73984 0 1.85ms 463μs 323μs 290μs 273μs 213μs 215μs 27.0s 72069 72069 0 35.8ms 366μs 316μs 285μs 270μs 212μs 221μs 28.0s 72660 72660 0 34.0ms 354μs 312μs 283μs 268μs 210μs 219μs 29.0s 72374 72374 0 1.59ms 870μs 344μs 292μs 276μs 216μs 220μs 30.0s 72031 72031 0 33.9ms 375μs 319μs 288μs 271μs 212μs 221μs 31.0s 74030 74030 0 1.84ms 488μs 326μs 289μs 273μs 212μs 215μs 32.0s 72400 72400 0 33.3ms 369μs 315μs 285μs 270μs 210μs 220μs 33.0s 71191 71191 0 34.8ms 382μs 322μs 290μs 274μs 214μs 224μs 34.0s 73433 73433 0 573μs 349μs 316μs 289μs 274μs 215μs 217μs 35.0s 70170 70170 0 37.4ms 781μs 347μs 293μs 276μs 215μs 227μs 36.0s 73203 73203 0 1.36ms 356μs 318μs 290μs 275μs 216μs 218μs 37.0s 71726 71726 0 34.5ms 440μs 319μs 286μs 271μs 213μs 222μs 38.0s 70500 70500 0 34.3ms 373μs 324μs 292μs 276μs 217μs 226μs 39.0s 74087 74087 0 1.43ms 367μs 319μs 288μs 272μs 213μs 215μs 40.0s 68335 68335 0 35.3ms 403μs 338μs 302μs 285μs 223μs 233μs 41.0s 68565 68565 0 27.1ms 412μs 336μs 302μs 286μs 225μs 232μs 42.0s 70874 70874 0 22.8ms 1.15ms 482μs 294μs 275μs 212μs 225μs 43.0s 71805 71805 0 28.0ms 471μs 322μs 289μs 273μs 214μs 222μs 44.0s 73421 73421 0 1.39ms 384μs 328μs 293μs 276μs 215μs 217μs 45.0s 72422 72422 0 33.7ms 361μs 314μs 285μs 269μs 211μs 220μs 46.0s 74150 74150 0 3.16ms 384μs 315μs 287μs 272μs 212μs 215μs 47.0s 72288 72288 0 35.0ms 368μs 313μs 284μs 269μs 211μs 220μs 48.0s 71578 71578 0 33.5ms 361μs 317μs 287μs 272μs 214μs 223μs 49.0s 73275 73275 0 2.47ms 880μs 333μs 290μs 273μs 213μs 217μs 50.0s 71695 71695 0 33.4ms 360μs 314μs 286μs 271μs 213μs 222μs 51.0s 74182 74182 0 1.34ms 356μs 316μs 288μs 272μs 213μs 215μs 52.0s 71606 71606 0 34.6ms 369μs 319μs 289μs 273μs 213μs 223μs 53.0s 72331 72331 0 36.0ms 360μs 313μs 284μs 268μs 211μs 220μs 54.0s 73054 73054 0 679μs 405μs 322μs 291μs 276μs 216μs 218μs 55.0s 70644 70644 0 34.3ms 686μs 346μs 292μs 275μs 213μs 226μs 56.0s 71554 71554 0 33.2ms 674μs 319μs 289μs 273μs 213μs 223μs 57.0s 74404 74404 0 1.30ms 351μs 314μs 286μs 271μs 212μs 214μs 58.0s 71700 71700 0 33.1ms 375μs 318μs 288μs 273μs 213μs 222μs 59.0s 72182 72182 0 1.68ms 384μs 326μs 296μs 280μs 219μs 221μs 1m0.0s 71219 71219 0 32.2ms 433μs 323μs 290μs 274μs 214μs 224μs 1m1.0s 71140 71140 0 33.2ms 379μs 319μs 289μs 274μs 215μs 224μs 1m2.0s 72428 72428 0 5.56ms 886μs 342μs 294μs 277μs 216μs 220μs 1m3.0s 71487 71487 0 32.4ms 373μs 322μs 289μs 273μs 214μs 223μs 1m4.0s 72515 72515 0 1.54ms 368μs 323μs 294μs 278μs 218μs 220μs 1m5.0s 71244 71244 0 32.8ms 388μs 327μs 292μs 276μs 214μs 224μs 1m6.0s 71459 71459 0 35.3ms 388μs 319μs 289μs 273μs 213μs 223μs 1m7.0s 73278 73278 0 1.30ms 380μs 322μs 292μs 276μs 215μs 217μs 1m8.0s 71029 71029 0 33.5ms 761μs 338μs 291μs 274μs 213μs 224μs ```

rukai commented 1 year ago

I tested my bencher again and observed that it doesnt happen locally, only when run on AWS. Maybe the smaller node size, or the extra latency is the cause? But I'll do some more investigation myself, when I get the chance.

I have a 3 node cassandra cluster running on 3 aws m6a.large instances I have a bencher running on aws m6a.large instance.

You should be able to reproduce with:

git clone https://github.com/shotover/shotover-proxy
cd shotover-proxy
cargo windsock --cloud --name cassandra,compression=none,driver=scylla,operation=write_blob,protocol=v4,shotover=none,topology=cluster3 --bench-length-seconds 60

BIG WARNING THOUGH: this will create amazon EC2 instances if you have AWS credentials setup on your machine. It will attempt to clean up after itself but you should make sure that it succeeds, possibly running cargo windsock --cleanup-cloud-resources to force a cleanup if it panics midway through. If that sounds scary, fair enough, maybe just setup your own bench manually on cloud infrastructure and see if you can reproduce that way.

The throughput drop happens at about 42s into the bench since the driver is started before benching starts.

scylladb / scylla-rust-driver

metadata refresh is really expensive! #786

issue 1

issue 2