scylladb / scylla-bench

42 stars 34 forks source link

Scylla-bench stops reports latency stats when latency is high #50

Closed juliayakovlev closed 3 years ago

juliayakovlev commented 3 years ago

Installation details Scylla version (or git commit hash): 4.4.rc1-0.20210223.9fc582ee8 with build-id a6ce2528451d7c29e1555c15960085dab0751b78 Cluster size: 4 nodes (i3en.3xlarge) OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0b8e9fcc7bfa8fab3 (aws: eu-north-1)

Test: longevity-large-partition-4days-test Test name: longevity_large_partition_test.LargePartitionLongevityTest.test_large_partition_longevity Test config file(s):

Test runs scylla-bench commands:

"scylla-bench -workload=sequential -mode=write -max-rate=300 -replication-factor=3 -partition-count=10 -clustering-row-count=10000000 -clustering-row-size=5120 -concurrency=7 -rows-per-request=10"

"scylla-bench -workload=uniform -mode=read -replication-factor=3 -partition-count=10 -clustering-row-count=10000000 -clustering-row-size=5120 -rows-per-request=10 -concurrency=7 -max-rate=32000 -duration=5760m"

When scylla-bench load is started, first 22 sec it's reported about latency. But from 23 sec it sends 0ms:

< t:2021-02-25 18:34:28,860 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG >  time   ops/s  rows/s errors max    99.9th 99th   95th   90th   median mean
< t:2021-02-25 18:34:29,861 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG >    1s    8820   88200      0 727ms  725ms  718ms  688ms  652ms  365ms  366ms  
< t:2021-02-25 18:34:30,862 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG >    2s    8544   85440      0 1.5s   1.5s   1.5s   1.4s   1.4s   1.1s   1.1s  
< t:2021-02-25 18:34:31,862 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG >    3s    8286   82860      0 2.2s   2.2s   2.2s   2.2s   2.1s   1.8s   1.8s  
< t:2021-02-25 18:34:32,862 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG >    4s    7984   79840      0 3s     3s     2.9s   2.9s   2.9s   2.6s   2.6s  
< t:2021-02-25 18:34:33,864 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG >    5s    7603   76030      0 3.7s   3.7s   3.7s   3.7s   3.6s   3.3s   3.3s   
.......................
< t:2021-02-25 18:34:50,871 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG >   22s    6305   63050      0 17.2s  17.2s  17.2s  17.1s  17.1s  16.8s  16.8s  
< t:2021-02-25 18:34:51,873 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG >   23s    5978   59780      0 33µs   0s     0s     0s     0s     0s     0s    
< t:2021-02-25 18:34:52,874 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG >   24s    6499   64990      0 33µs   0s     0s     0s     0s     0s     0s    
< t:2021-02-25 18:34:53,875 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG >   25s    6550   65500      0 33µs   0s     0s     0s     0s     0s     0s    
< t:2021-02-25 18:34:54,876 f:base.py         l:220  c:RemoteCmdRunner      p:DEBUG >   26s    6483   64830      0 33µs   0s     0s     0s     0s     0s     0s     

Latency increased up to 17s and stopped to be reported

Restore Monitor Stack command: $ hydra investigate show-monitor 292de59d-840c-493c-a164-263a3ea59ea3 Show all stored logs command: $ hydra investigate show-logs 292de59d-840c-493c-a164-263a3ea59ea3

Test id: 292de59d-840c-493c-a164-263a3ea59ea3

Logs: loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/292de59d-840c-493c-a164-263a3ea59ea3/20210301_041635/loader-set-292de59d.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/292de59d-840c-493c-a164-263a3ea59ea3/20210301_041635/sct-runner-292de59d.zip

Jenkins job URL

sitano commented 3 years ago

probably the reason is that HdrHistogram can't register that high values of the latency and the err is not checked

juliayakovlev commented 3 years ago

@sitano by the numbers it looks like the latency is summarise on every second with previous value. What do you think? May it happened?

sitano commented 3 years ago

this is another concern, as far as I know for per second reporting a tool shall use it's own fresh instance of histogram every cycle. and the final must be an aggregate. thats why you would not see that with YCSB or C-S with per second reporting. Anyway, this growing latency is the CO correction.

dkropachev commented 3 years ago

This is regression caused by https://github.com/scylladb/scylla-bench/commit/f169752fc8b044daab45a9f4a93565e1bc5b8e2b

dkropachev commented 3 years ago

Root causes of the issue are:

  1. We ignore errors on storing latencies: at modes.go: RecordLatency
  2. MaximumRateLimiter.Expected is done so that it accumulates delays over time

As result latency increases over time, go over limit of hdrhistogram.Histogram and silently stop being reported

sitano commented 3 years ago

Root causes of the issue are:

  1. We ignore errors on storing latencies: at modes.go: RecordLatency
  2. MaximumRateLimiter.Expected is done so that it accumulates delays over time

As result latency increases over time, go over limit of hdrhistogram.Histogram and silently stop being reported

I don't think those are the root cause. If my understanding is correct that the problem is that interpolation causes the latency to grow indefinitely when the program can't keep up with the picked rate then the problem is the configuration, not the Coordinated Omission fix. Deleting interpolation does not solve the issue. It only hides the problem.

dkropachev commented 3 years ago

fixed by https://github.com/scylladb/scylla-bench/pull/52