Open yarongilor opened 1 year ago
Describe your issue in detail and steps it took to produce it.
Describe the impact this issue causes to the user.
loader-4 ran the following s-b read stress for ~ 12.5 hours (out of expected 24 hours).
Its log file is: scylla-bench-l0-bdfd47ae-45c4-4348-a835-ba48fe0397b9.log
SCT log shows:
2023_07_29__03_05_15_387.sct-5cd89f45.log:< t:2023-07-29 07:09:25,838 f:scylla_bench_thread.py l:195 c:sdcm.scylla_bench_thread p:DEBUG > Scylla bench command: scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=1000 -clustering-row-count=200000 -clustering-row-size=uniform:10..10240 -rows-per-request=10 -consistency-level=quorum -timeout=90s -partition-offset=1001 -concurrency=100 -connection-count=100 -iterations=0 -duration=1440m -error-at-row-limit 1000 2023_07_29__03_05_15_387.sct-5cd89f45.log:< t:2023-07-29 07:09:25,840 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo docker exec e9a633737a5dc44efd1e27efe37cceba012565f98f333d7a4828a9ae42f4b1e7 /bin/sh -c 'scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=1000 -clustering-row-count=200000 -clustering-row-size=uniform:10..10240 -rows-per-request=10 -consistency-level=quorum -timeout=90s -partition-offset=1001 -concurrency=100 -connection-count=100 -iterations=0 -duration=1440m -error-at-row-limit 1000 -nodes 10.142.0.134,10.142.0.143,10.142.0.147,10.142.0.152,10.142.0.8'"... 2023_07_29__03_05_15_387.sct-5cd89f45.log:< t:2023-07-29 07:09:25,841 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > stress_cmd=scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=1000 -clustering-row-count=200000 -clustering-row-size=uniform:10..10240 -rows-per-request=10 -consistency-level=quorum -timeout=90s -partition-offset=1001 -concurrency=100 -connection-count=100 -iterations=0 -duration=1440m -error-at-row-limit 1000 -nodes 10.142.0.134,10.142.0.143,10.142.0.147,10.142.0.152,10.142.0.8 2023_07_29__19_27_20_757.sct-5cd89f45.log:< t:2023-07-29 19:39:25,215 f:base.py l:146 c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "sudo docker exec e9a633737a5dc44efd1e27efe37cceba012565f98f333d7a4828a9ae42f4b1e7 /bin/sh -c 'scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=1000 -clustering-row-count=200000 -clustering-row-size=uniform:10..10240 -rows-per-request=10 -consistency-level=quorum -timeout=90s -partition-offset=1001 -concurrency=100 -connection-count=100 -iterations=0 -duration=1440m -error-at-row-limit 1000 -nodes 10.142.0.134,10.142.0.143,10.142.0.147,10.142.0.152,10.142.0.8'"; Exit status: 137
The grafana shows:
Kernel Version: 5.15.0-1038-gcp Scylla version (or git commit hash): 5.4.0~dev-20230728.7351c8424df3 with build-id d47e513b9bff8db5782a5220c05f5ac2a70b7a6c
5.4.0~dev-20230728.7351c8424df3
d47e513b9bff8db5782a5220c05f5ac2a70b7a6c
Cluster size: 5 nodes (n2-highmem-16)
Scylla Nodes used in this run:
OS / Image: `` (gce: undefined_region)
Test: longevity-large-partition-200k-pks-4days-gce-test Test id: 5cd89f45-07f3-4865-8f4a-34c4071e2bdd Test name: scylla-master/longevity/longevity-large-partition-200k-pks-4days-gce-test Test config file(s):
longevity-large-partition-200k-pks-4days-gce-test
5cd89f45-07f3-4865-8f4a-34c4071e2bdd
scylla-master/longevity/longevity-large-partition-200k-pks-4days-gce-test
@yarongilor , have we seen such OOM before?
yes, we saw - https://github.com/scylladb/scylla-bench/issues/89
Issue description
Describe your issue in detail and steps it took to produce it.
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
loader-4 ran the following s-b read stress for ~ 12.5 hours (out of expected 24 hours).
Its log file is: scylla-bench-l0-bdfd47ae-45c4-4348-a835-ba48fe0397b9.log
SCT log shows:
The grafana shows:
Installation details
Kernel Version: 5.15.0-1038-gcp Scylla version (or git commit hash):
5.4.0~dev-20230728.7351c8424df3
with build-idd47e513b9bff8db5782a5220c05f5ac2a70b7a6c
Cluster size: 5 nodes (n2-highmem-16)
Scylla Nodes used in this run:
OS / Image: `` (gce: undefined_region)
Test:
longevity-large-partition-200k-pks-4days-gce-test
Test id:5cd89f45-07f3-4865-8f4a-34c4071e2bdd
Test name:scylla-master/longevity/longevity-large-partition-200k-pks-4days-gce-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 5cd89f45-07f3-4865-8f4a-34c4071e2bdd` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=5cd89f45-07f3-4865-8f4a-34c4071e2bdd) - Show all stored logs command: `$ hydra investigate show-logs 5cd89f45-07f3-4865-8f4a-34c4071e2bdd` ## Logs: - **db-cluster-5cd89f45.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/db-cluster-5cd89f45.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/db-cluster-5cd89f45.tar.gz) - **debug-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/debug-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/debug-5cd89f45.log.tar.gz) - **raw_events-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/raw_events-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/raw_events-5cd89f45.log.tar.gz) - **summary-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/summary-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/summary-5cd89f45.log.tar.gz) - **error-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/error-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/error-5cd89f45.log.tar.gz) - **critical-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/critical-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/critical-5cd89f45.log.tar.gz) - **output-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/output-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/output-5cd89f45.log.tar.gz) - **warning-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/warning-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/warning-5cd89f45.log.tar.gz) - **argus-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/argus-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/argus-5cd89f45.log.tar.gz) - **normal-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/normal-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/normal-5cd89f45.log.tar.gz) - **events-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/events-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/events-5cd89f45.log.tar.gz) - **2023_07_29__03_05_15_387.sct-5cd89f45.log.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_29__03_05_15_387.sct-5cd89f45.log.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_29__03_05_15_387.sct-5cd89f45.log.gz) - **2023_07_29__13_08_54_529.sct-5cd89f45.log.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_29__13_08_54_529.sct-5cd89f45.log.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_29__13_08_54_529.sct-5cd89f45.log.gz) - **2023_07_29__19_27_20_757.sct-5cd89f45.log.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_29__19_27_20_757.sct-5cd89f45.log.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_29__19_27_20_757.sct-5cd89f45.log.gz) - **2023_07_30__02_39_44_655.sct-5cd89f45.log.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_30__02_39_44_655.sct-5cd89f45.log.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_30__02_39_44_655.sct-5cd89f45.log.gz) - **loader-set-5cd89f45.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/loader-set-5cd89f45.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/loader-set-5cd89f45.tar.gz) - **monitor-set-5cd89f45.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/monitor-set-5cd89f45.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/monitor-set-5cd89f45.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-large-partition-200k-pks-4days-gce-test/7/) [Argus](https://argus.scylladb.com/test/917e825f-11f9-4493-acdb-ec5266a3af78/runs?additionalRuns[]=5cd89f45-07f3-4865-8f4a-34c4071e2bdd)