Closed yarongilor closed 1 year ago
Not sure what next step for this investigation should be. Cc: @roydahan , @vponomaryov , @fruch
Not sure what next step for this investigation should be. Cc: @roydahan , @vponomaryov , @fruch
First opening the loader log, seeing what is the actual failure that stops scylla-bench ?
Second, edit the summary with something more useful that points this is really scylla-bench issue.
Not sure what next step for this investigation should be. Cc: @roydahan , @vponomaryov , @fruch
First opening the loader log, seeing what is the actual failure that stops scylla-bench ?
well, if there was any actual failure, it would have been clearer, but there isn't. The last line in log is just a normal read stat:
10h27m14.8s 4407 8632122 0 14.5s 2.8s 2.6s 1.3s 180ms 44ms 185ms
Second, edit the summary with something more useful that points this is really scylla-bench issue.
This is the most useful summary currently found in this case - the test does run ok without reversed queries for 2 days. Then when switching to reversed-queries, it fails after 10 hours, not specifying any reason why stress stopped.
Not sure what next step for this investigation should be. Cc: @roydahan , @vponomaryov , @fruch
First opening the loader log, seeing what is the actual failure that stops scylla-bench ?
well, if there was any actual failure, it would have been clearer, but there isn't. The last line in log is just a normal read stat:
10h27m14.8s 4407 8632122 0 14.5s 2.8s 2.6s 1.3s 180ms 44ms 185ms
what about the loader memory ? if I recall correctly you enlarged the instances cause of that
Not sure what next step for this investigation should be. Cc: @roydahan , @vponomaryov , @fruch
First opening the loader log, seeing what is the actual failure that stops scylla-bench ?
well, if there was any actual failure, it would have been clearer, but there isn't. The last line in log is just a normal read stat:
10h27m14.8s 4407 8632122 0 14.5s 2.8s 2.6s 1.3s 180ms 44ms 185ms
what about the loader memory ? if I recall correctly you enlarged the instances cause of that
right, @fruch , that looks like a good guess. So it appears, where using reversed-query parameter in s-b, The memory leak is more than twice faster/larger (than non-reversed) and it may indeed be a dup of https://github.com/scylladb/scylla-bench/issues/112 , only with a different symptom and parameter.
Not sure what next step for this investigation should be. Cc: @roydahan , @vponomaryov , @fruch
First opening the loader log, seeing what is the actual failure that stops scylla-bench ?
well, if there was any actual failure, it would have been clearer, but there isn't. The last line in log is just a normal read stat:
10h27m14.8s 4407 8632122 0 14.5s 2.8s 2.6s 1.3s 180ms 44ms 185ms
what about the loader memory ? if I recall correctly you enlarged the instances cause of that
right, @fruch , that looks like a good guess. So it appears, where using reversed-query parameter in s-b, The memory leak is more than twice faster/larger (than non-reversed) and it may indeed be a dup of #112 , only with a different symptom and parameter.
well it's almost identical query as in #112, I would expect it to leak almost the same, but again we don't know the source of the leak, so assumption anything on it's behavior is purely guessing.
Installation details
Kernel Version: 5.15.0-1026-aws Scylla version (or git commit hash):
2022.2.0~rc6-20221130.81c9ffb0b459
with build-ida4847c84172de3dbbdbfaf92acf970eb6de1ca28
Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc6.0.20221130.81c9ffb0b459.tar.gzCluster size: 4 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0fa4f026d1ce15529
(aws: eu-west-1)Test:
longevity-large-partition-4days-test-rq2
Test id:3c6d345a-8865-4609-8d1a-502119c16eb9
Test name:scylla-staging/Longevity_yaron/longevity-large-partition-4days-test-rq2
Test config file(s):Issue description
>>>>>>> scenario: running a single read stress with reversed queries (-select-order-by=desc) and 10 retries.
Got multiple timeout errors in an interval of 2 hours:
The throughput and load were quite stable along test. Then, after another 2.5 hours, the stress failed.
There were (unexpected) drops at 21.22 and 21.24. Then stress completely failed at 21.26.
<<<<<<<
$ hydra investigate show-monitor 3c6d345a-8865-4609-8d1a-502119c16eb9
$ hydra investigate show-logs 3c6d345a-8865-4609-8d1a-502119c16eb9
Logs:
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Log links for testrun with test id 3c6d345a-8865-4609-8d1a-502119c16eb9 | +-----------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Date | Log type | Link | +-----------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 20221214_113932 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_113932/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221214_114041-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_113932 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_113932/grafana-screenshot-overview-20221214_113933-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_202738 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_202738/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221214_202818-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_202738 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_202738/grafana-screenshot-overview-20221214_202739-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_203552 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_203552/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221214_203629-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_203552 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_203552/grafana-screenshot-overview-20221214_203552-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_213442 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_213442/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221214_213525-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_213442 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_213442/grafana-screenshot-overview-20221214_213443-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_214637 | db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_214637/db-cluster-3c6d345a.tar.gz | | 20221214_214637 | loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_214637/loader-set-3c6d345a.tar.gz | | 20221214_214637 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_214637/monitor-set-3c6d345a.tar.gz | | 20221214_214637 | sct | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_214637/sct-runner-3c6d345a.tar.gz | +-----------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Jenkins job URL