Closed juliayakovlev closed 3 years ago
probably the reason is that HdrHistogram can't register that high values of the latency and the err is not checked
@sitano by the numbers it looks like the latency is summarise on every second with previous value. What do you think? May it happened?
this is another concern, as far as I know for per second reporting a tool shall use it's own fresh instance of histogram every cycle. and the final must be an aggregate. thats why you would not see that with YCSB or C-S with per second reporting. Anyway, this growing latency is the CO correction.
This is regression caused by https://github.com/scylladb/scylla-bench/commit/f169752fc8b044daab45a9f4a93565e1bc5b8e2b
Root causes of the issue are:
As result latency increases over time, go over limit of hdrhistogram.Histogram and silently stop being reported
Root causes of the issue are:
- We ignore errors on storing latencies: at modes.go: RecordLatency
- MaximumRateLimiter.Expected is done so that it accumulates delays over time
As result latency increases over time, go over limit of hdrhistogram.Histogram and silently stop being reported
I don't think those are the root cause. If my understanding is correct that the problem is that interpolation causes the latency to grow indefinitely when the program can't keep up with the picked rate then the problem is the configuration, not the Coordinated Omission fix. Deleting interpolation does not solve the issue. It only hides the problem.
Installation details Scylla version (or git commit hash):
4.4.rc1-0.20210223.9fc582ee8 with build-id a6ce2528451d7c29e1555c15960085dab0751b78
Cluster size: 4 nodes (i3en.3xlarge) OS (RHEL/CentOS/Ubuntu/AWS AMI):ami-0b8e9fcc7bfa8fab3
(aws: eu-north-1)Test:
longevity-large-partition-4days-test
Test name:longevity_large_partition_test.LargePartitionLongevityTest.test_large_partition_longevity
Test config file(s):Test runs scylla-bench commands:
When scylla-bench load is started, first 22 sec it's reported about latency. But from 23 sec it sends 0ms:
Latency increased up to 17s and stopped to be reported
Restore Monitor Stack command:
$ hydra investigate show-monitor 292de59d-840c-493c-a164-263a3ea59ea3
Show all stored logs command:$ hydra investigate show-logs 292de59d-840c-493c-a164-263a3ea59ea3
Test id:
292de59d-840c-493c-a164-263a3ea59ea3
Logs: loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/292de59d-840c-493c-a164-263a3ea59ea3/20210301_041635/loader-set-292de59d.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/292de59d-840c-493c-a164-263a3ea59ea3/20210301_041635/sct-runner-292de59d.zip
Jenkins job URL