scylladb / scylla-bench

43 stars 36 forks source link

A read stress using reversed-queries failed after 19 hours running with nemesis #115

Closed yarongilor closed 1 year ago

yarongilor commented 1 year ago

Installation details

Kernel Version: 5.15.0-1026-aws Scylla version (or git commit hash): 2022.2.0~rc6-20221130.81c9ffb0b459 with build-id a4847c84172de3dbbdbfaf92acf970eb6de1ca28 Relocatable Package: http://downloads.scylladb.com/downloads/scylla-enterprise/relocatable/scylladb-2022.2/scylla-enterprise-x86_64-package-2022.2.0-rc6.0.20221130.81c9ffb0b459.tar.gz

Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0fa4f026d1ce15529 (aws: eu-west-1)

Test: longevity-large-partition-4days-test-rq2 Test id: 3c6d345a-8865-4609-8d1a-502119c16eb9 Test name: scylla-staging/Longevity_yaron/longevity-large-partition-4days-test-rq2 Test config file(s):

Issue description

>>>>>>> scenario: running a single read stress with reversed queries (-select-order-by=desc) and 10 retries.

Got multiple timeout errors in an interval of 2 hours:

yarongilor@yarongilor:~/Downloads/logs/sct-runner-3c6d345a$ grep f:file_logger.py sct.log | grep 'Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM' -c
165
yarongilor@yarongilor:~/Downloads/logs/sct-runner-3c6d345a$ grep f:file_logger.py sct.log | grep 'Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM' | head -n 1
< t:2022-12-14 16:36:40,277 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > 2022/12/14 16:36:26 Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.
yarongilor@yarongilor:~/Downloads/logs/sct-runner-3c6d345a$ grep f:file_logger.py sct.log | grep 'Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM' | tail -n 1
< t:2022-12-14 18:50:27,890 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > 2022/12/14 18:49:12 Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.

The throughput and load were quite stable along test. Then, after another 2.5 hours, the stress failed.

There were (unexpected) drops at 21.22 and 21.24. Then stress completely failed at 21.26. image

< t:2022-12-14 10:58:53,442 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-12-14 10:58:53.440: (ScyllaBenchEvent Severity.NORMAL) period_type=begin event_id=c6613a6d-ed38-4cac-8539-30a88cd73a61: node=Node longevity-large-partitions-2d-order-loader-node-3c6d345a-1 [34.245.108.140 | 10.4.2.196] (seed: False)
< t:2022-12-14 10:58:53,442 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=scylla-bench -workload=uniform -mode=read -select-order-by=desc -replication-factor=3 -partition-count=800 -clustering-row-count=50000 -clustering-row-size=uniform:100..5120 -rows-per-request=2000 -timeout=240s -retry-number=10 -concurrency=400 -duration=44h -connection-count 400 -nodes 10.4.1.3,10.4.2.27,10.4.2.76,10.4.0.56
< t:2022-12-14 10:58:53,603 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Configuration
< t:2022-12-14 10:58:53,603 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Mode:                    read
< t:2022-12-14 10:58:53,603 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Workload:                uniform
< t:2022-12-14 10:58:53,603 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Timeout:                 4m0s
< t:2022-12-14 10:58:53,603 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Max error number at row: unlimited
< t:2022-12-14 10:58:53,604 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Max error number:        unlimited
< t:2022-12-14 10:58:53,604 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > Retries:                
< t:2022-12-14 10:58:53,604 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   number:                10
< t:2022-12-14 10:58:53,604 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   min interval:          80ms
< t:2022-12-14 10:58:53,604 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG >   max interval:          1s
--
< t:2022-12-14 21:26:09,468 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > 10h27m14.8s    4407 8632122      0 14.5s  2.8s   2.6s   1.3s   180ms  44ms   185ms  
< t:2022-12-14 21:26:11,474 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 

< t:2022-12-14 21:26:11,476 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-12-14 21:26:11.474: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=c6613a6d-ed38-4cac-8539-30a88cd73a61 duration=10h27m18s: node=Node longevity-large-partitions-2d-order-loader-node-3c6d345a-1 [34.245.108.140 | 10.4.2.196] (seed: False)
< t:2022-12-14 21:26:11,476 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=scylla-bench -workload=uniform -mode=read -select-order-by=desc -replication-factor=3 -partition-count=800 -clustering-row-count=50000 -clustering-row-size=uniform:100..5120 -rows-per-request=2000 -timeout=240s -retry-number=10 -concurrency=400 -duration=44h -connection-count 400 -nodes 10.4.1.3,10.4.2.27,10.4.2.76,10.4.0.56
< t:2022-12-14 21:26:11,476 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > errors:
< t:2022-12-14 21:26:11,476 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 
< t:2022-12-14 21:26:11,476 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stress command execution failed with:

<<<<<<<

Logs:

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Log links for testrun with test id 3c6d345a-8865-4609-8d1a-502119c16eb9 | +-----------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Date | Log type | Link | +-----------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 20221214_113932 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_113932/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221214_114041-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_113932 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_113932/grafana-screenshot-overview-20221214_113933-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_202738 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_202738/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221214_202818-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_202738 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_202738/grafana-screenshot-overview-20221214_202739-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_203552 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_203552/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221214_203629-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_203552 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_203552/grafana-screenshot-overview-20221214_203552-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_213442 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_213442/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221214_213525-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_213442 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_213442/grafana-screenshot-overview-20221214_213443-longevity-large-partitions-2d-order-monitor-node-3c6d345a-1.png | | 20221214_214637 | db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_214637/db-cluster-3c6d345a.tar.gz | | 20221214_214637 | loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_214637/loader-set-3c6d345a.tar.gz | | 20221214_214637 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_214637/monitor-set-3c6d345a.tar.gz | | 20221214_214637 | sct | https://cloudius-jenkins-test.s3.amazonaws.com/3c6d345a-8865-4609-8d1a-502119c16eb9/20221214_214637/sct-runner-3c6d345a.tar.gz | +-----------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Jenkins job URL

yarongilor commented 1 year ago

Not sure what next step for this investigation should be. Cc: @roydahan , @vponomaryov , @fruch

fruch commented 1 year ago

Not sure what next step for this investigation should be. Cc: @roydahan , @vponomaryov , @fruch

First opening the loader log, seeing what is the actual failure that stops scylla-bench ?

roydahan commented 1 year ago

Second, edit the summary with something more useful that points this is really scylla-bench issue.

yarongilor commented 1 year ago

Not sure what next step for this investigation should be. Cc: @roydahan , @vponomaryov , @fruch

First opening the loader log, seeing what is the actual failure that stops scylla-bench ?

well, if there was any actual failure, it would have been clearer, but there isn't. The last line in log is just a normal read stat:

10h27m14.8s    4407 8632122      0 14.5s  2.8s   2.6s   1.3s   180ms  44ms   185ms
yarongilor commented 1 year ago

Second, edit the summary with something more useful that points this is really scylla-bench issue.

This is the most useful summary currently found in this case - the test does run ok without reversed queries for 2 days. Then when switching to reversed-queries, it fails after 10 hours, not specifying any reason why stress stopped.

fruch commented 1 year ago

Not sure what next step for this investigation should be. Cc: @roydahan , @vponomaryov , @fruch

First opening the loader log, seeing what is the actual failure that stops scylla-bench ?

well, if there was any actual failure, it would have been clearer, but there isn't. The last line in log is just a normal read stat:

10h27m14.8s    4407 8632122      0 14.5s  2.8s   2.6s   1.3s   180ms  44ms   185ms

what about the loader memory ? if I recall correctly you enlarged the instances cause of that

yarongilor commented 1 year ago

Not sure what next step for this investigation should be. Cc: @roydahan , @vponomaryov , @fruch

First opening the loader log, seeing what is the actual failure that stops scylla-bench ?

well, if there was any actual failure, it would have been clearer, but there isn't. The last line in log is just a normal read stat:

10h27m14.8s    4407 8632122      0 14.5s  2.8s   2.6s   1.3s   180ms  44ms   185ms

what about the loader memory ? if I recall correctly you enlarged the instances cause of that

right, @fruch , that looks like a good guess. So it appears, where using reversed-query parameter in s-b, The memory leak is more than twice faster/larger (than non-reversed) and it may indeed be a dup of https://github.com/scylladb/scylla-bench/issues/112 , only with a different symptom and parameter.

image

fruch commented 1 year ago

Not sure what next step for this investigation should be. Cc: @roydahan , @vponomaryov , @fruch

First opening the loader log, seeing what is the actual failure that stops scylla-bench ?

well, if there was any actual failure, it would have been clearer, but there isn't. The last line in log is just a normal read stat:

10h27m14.8s    4407 8632122      0 14.5s  2.8s   2.6s   1.3s   180ms  44ms   185ms

what about the loader memory ? if I recall correctly you enlarged the instances cause of that

right, @fruch , that looks like a good guess. So it appears, where using reversed-query parameter in s-b, The memory leak is more than twice faster/larger (than non-reversed) and it may indeed be a dup of #112 , only with a different symptom and parameter.

image

well it's almost identical query as in #112, I would expect it to leak almost the same, but again we don't know the source of the leak, so assumption anything on it's behavior is purely guessing.

yarongilor commented 1 year ago

dup of https://github.com/scylladb/scylla-bench/issues/112