scylladb / scylla-bench

43 stars 36 forks source link

Loader got OOM running a scylla-bench read stress #127

Closed yarongilor closed 5 days ago

yarongilor commented 1 year ago

Issue description

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

loader-4 ran the following s-b read stress for ~ 12.5 hours (out of expected 24 hours).

Its log file is: scylla-bench-l0-bdfd47ae-45c4-4348-a835-ba48fe0397b9.log

SCT log shows:

2023_07_29__03_05_15_387.sct-5cd89f45.log:< t:2023-07-29 07:09:25,838 f:scylla_bench_thread.py l:195  c:sdcm.scylla_bench_thread p:DEBUG > Scylla bench command: scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=1000 -clustering-row-count=200000 -clustering-row-size=uniform:10..10240 -rows-per-request=10 -consistency-level=quorum -timeout=90s -partition-offset=1001 -concurrency=100 -connection-count=100 -iterations=0 -duration=1440m  -error-at-row-limit 1000
2023_07_29__03_05_15_387.sct-5cd89f45.log:< t:2023-07-29 07:09:25,840 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo  docker exec e9a633737a5dc44efd1e27efe37cceba012565f98f333d7a4828a9ae42f4b1e7 /bin/sh -c 'scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=1000 -clustering-row-count=200000 -clustering-row-size=uniform:10..10240 -rows-per-request=10 -consistency-level=quorum -timeout=90s -partition-offset=1001 -concurrency=100 -connection-count=100 -iterations=0 -duration=1440m  -error-at-row-limit 1000 -nodes 10.142.0.134,10.142.0.143,10.142.0.147,10.142.0.152,10.142.0.8'"...
2023_07_29__03_05_15_387.sct-5cd89f45.log:< t:2023-07-29 07:09:25,841 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=1000 -clustering-row-count=200000 -clustering-row-size=uniform:10..10240 -rows-per-request=10 -consistency-level=quorum -timeout=90s -partition-offset=1001 -concurrency=100 -connection-count=100 -iterations=0 -duration=1440m  -error-at-row-limit 1000 -nodes 10.142.0.134,10.142.0.143,10.142.0.147,10.142.0.152,10.142.0.8
2023_07_29__19_27_20_757.sct-5cd89f45.log:< t:2023-07-29 19:39:25,215 f:base.py         l:146  c:RemoteLibSSH2CmdRunner p:ERROR > Error executing command: "sudo  docker exec e9a633737a5dc44efd1e27efe37cceba012565f98f333d7a4828a9ae42f4b1e7 /bin/sh -c 'scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=1000 -clustering-row-count=200000 -clustering-row-size=uniform:10..10240 -rows-per-request=10 -consistency-level=quorum -timeout=90s -partition-offset=1001 -concurrency=100 -connection-count=100 -iterations=0 -duration=1440m  -error-at-row-limit 1000 -nodes 10.142.0.134,10.142.0.143,10.142.0.147,10.142.0.152,10.142.0.8'"; Exit status: 137

The grafana shows: Screenshot from 2023-08-02 19-51-34

Installation details

Kernel Version: 5.15.0-1038-gcp Scylla version (or git commit hash): 5.4.0~dev-20230728.7351c8424df3 with build-id d47e513b9bff8db5782a5220c05f5ac2a70b7a6c

Cluster size: 5 nodes (n2-highmem-16)

Scylla Nodes used in this run:

OS / Image: `` (gce: undefined_region)

Test: longevity-large-partition-200k-pks-4days-gce-test Test id: 5cd89f45-07f3-4865-8f4a-34c4071e2bdd Test name: scylla-master/longevity/longevity-large-partition-200k-pks-4days-gce-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 5cd89f45-07f3-4865-8f4a-34c4071e2bdd` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=5cd89f45-07f3-4865-8f4a-34c4071e2bdd) - Show all stored logs command: `$ hydra investigate show-logs 5cd89f45-07f3-4865-8f4a-34c4071e2bdd` ## Logs: - **db-cluster-5cd89f45.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/db-cluster-5cd89f45.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/db-cluster-5cd89f45.tar.gz) - **debug-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/debug-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/debug-5cd89f45.log.tar.gz) - **raw_events-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/raw_events-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/raw_events-5cd89f45.log.tar.gz) - **summary-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/summary-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/summary-5cd89f45.log.tar.gz) - **error-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/error-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/error-5cd89f45.log.tar.gz) - **critical-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/critical-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/critical-5cd89f45.log.tar.gz) - **output-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/output-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/output-5cd89f45.log.tar.gz) - **warning-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/warning-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/warning-5cd89f45.log.tar.gz) - **argus-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/argus-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/argus-5cd89f45.log.tar.gz) - **normal-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/normal-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/normal-5cd89f45.log.tar.gz) - **events-5cd89f45.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/events-5cd89f45.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/events-5cd89f45.log.tar.gz) - **2023_07_29__03_05_15_387.sct-5cd89f45.log.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_29__03_05_15_387.sct-5cd89f45.log.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_29__03_05_15_387.sct-5cd89f45.log.gz) - **2023_07_29__13_08_54_529.sct-5cd89f45.log.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_29__13_08_54_529.sct-5cd89f45.log.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_29__13_08_54_529.sct-5cd89f45.log.gz) - **2023_07_29__19_27_20_757.sct-5cd89f45.log.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_29__19_27_20_757.sct-5cd89f45.log.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_29__19_27_20_757.sct-5cd89f45.log.gz) - **2023_07_30__02_39_44_655.sct-5cd89f45.log.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_30__02_39_44_655.sct-5cd89f45.log.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/2023_07_30__02_39_44_655.sct-5cd89f45.log.gz) - **loader-set-5cd89f45.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/loader-set-5cd89f45.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/loader-set-5cd89f45.tar.gz) - **monitor-set-5cd89f45.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/monitor-set-5cd89f45.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5cd89f45-07f3-4865-8f4a-34c4071e2bdd/20230730_091441/monitor-set-5cd89f45.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-large-partition-200k-pks-4days-gce-test/7/) [Argus](https://argus.scylladb.com/test/917e825f-11f9-4493-acdb-ec5266a3af78/runs?additionalRuns[]=5cd89f45-07f3-4865-8f4a-34c4071e2bdd)
fgelcer commented 1 year ago

@yarongilor , have we seen such OOM before?

yarongilor commented 1 year ago

yes, we saw - https://github.com/scylladb/scylla-bench/issues/89

fruch commented 5 days ago

since the loader size was increased, and test was refactor by @roydahan we didn't seen this issue anymore

closing for now