Stress command completed with bad status 2: fatal error: runtime: out of memory

fgelcer commented 2 years ago

Installation details Kernel version: 5.11.0-1027-aws Scylla version (or git commit hash): 5.0.dev-0.20220127.ba6c02b38 with build-id b93317e46cc252428454f96e8716b0948f28304c Cluster size: 4 nodes (i3en.2xlarge) Scylla running with shards number (live nodes): longevity-twcs-48h-master-db-node-cbff75ad-1 (52.213.186.186 | 10.0.1.49): 8 shards longevity-twcs-48h-master-db-node-cbff75ad-2 (34.240.8.80 | 10.0.1.57): 8 shards longevity-twcs-48h-master-db-node-cbff75ad-3 (63.32.43.36 | 10.0.3.184): 8 shards longevity-twcs-48h-master-db-node-cbff75ad-4 (3.250.222.82 | 10.0.0.62): 8 shards OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-098e8a18da4ea000f (aws: eu-west-1)

Test: longevity-twcs-48h-test Test name: longevity_twcs_test.TWCSLongevityTest.test_custom_time Test config file(s):

longevity-twcs-48h.yaml

Issue description

2022-01-27 16:47:11.721: (ScyllaBenchEvent Severity.NORMAL) period_type=begin event_id=1e47c93c-159b-4b13-8b7e-923ffa59f1d2: node=Node longevity-twcs-48h-master-loader-node-cbff75ad-1 [18.203.69.245 | 10.0.2.209] (seed: False)
stress_cmd=scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=400 -clustering-row-count=10000000 -clustering-row-size=200 -concurrency=100 -rows-per-request=100 -start-timestamp=SET_WRITE_TIMESTAMP -connection-count 100 -max-rate 50000 --timeout 120s -duration=2880m -error-at-row-limit 1000
2022-01-27 16:48:11.743: (ScyllaBenchEvent Severity.NORMAL) period_type=begin event_id=ae5dc835-8ef1-4361-a160-e4c58b0da3e1: node=Node longevity-twcs-48h-master-loader-node-cbff75ad-2 [34.247.166.94 | 10.0.0.150] (seed: False)
stress_cmd=scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=400 -clustering-row-count=10000000 -clustering-row-size=200 -concurrency=100 -rows-per-request=100 -start-timestamp=SET_WRITE_TIMESTAMP -connection-count 100 -max-rate 50000 --timeout 120s -duration=2880m -error-at-row-limit 1000
2022-01-27 16:48:49.009: (ScyllaBenchEvent Severity.CRITICAL) period_type=end event_id=1e47c93c-159b-4b13-8b7e-923ffa59f1d2 duration=1m37s: node=Node longevity-twcs-48h-master-loader-node-cbff75ad-1 [18.203.69.245 | 10.0.2.209] (seed: False)
stress_cmd=scylla-bench -workload=timeseries -mode=write -replication-factor=3 -partition-count=400 -clustering-row-count=10000000 -clustering-row-size=200 -concurrency=100 -rows-per-request=100 -start-timestamp=SET_WRITE_TIMESTAMP -connection-count 100 -max-rate 50000 --timeout 120s -duration=2880m -error-at-row-limit 1000
errors:

Stress command completed with bad status 2: fatal error: runtime: out of memory

runtime stack:
runtime.throw({0x6f6a40, 0x1800000})
        /usr/local

Restore Monitor Stack command: $ hydra investigate show-monitor cbff75ad-db6e-458d-a51a-f7c51d4bbf8a Restore monitor on AWS instance using Jenkins job Show all stored logs command: $ hydra investigate show-logs cbff75ad-db6e-458d-a51a-f7c51d4bbf8a

Test id: cbff75ad-db6e-458d-a51a-f7c51d4bbf8a

Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/cbff75ad-db6e-458d-a51a-f7c51d4bbf8a/20220127_164957/grafana-screenshot-longevity-twcs-48h-test-scylla-per-server-metrics-nemesis-20220127_164957-longevity-twcs-48h-master-monitor-node-cbff75ad-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/cbff75ad-db6e-458d-a51a-f7c51d4bbf8a/20220127_165605/db-cluster-cbff75ad.tar.gz loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/cbff75ad-db6e-458d-a51a-f7c51d4bbf8a/20220127_165605/loader-set-cbff75ad.tar.gz monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/cbff75ad-db6e-458d-a51a-f7c51d4bbf8a/20220127_165605/monitor-set-cbff75ad.tar.gz sct - https://cloudius-jenkins-test.s3.amazonaws.com/cbff75ad-db6e-458d-a51a-f7c51d4bbf8a/20220127_165605/sct-runner-cbff75ad.tar.gz

Jenkins job URL

roydahan commented 2 years ago

@dkropachev any idea?

dkropachev commented 2 years ago

It is related to hdr memory consumption and fixed at v0.1.8 of s-b, related sct PR - https://github.com/scylladb/scylla-cluster-tests/pull/4384

dkropachev commented 2 years ago

Should be fixed at https://github.com/scylladb/scylla-bench/pull/87 and https://github.com/scylladb/scylla-bench/pull/87

scylladb / scylla-bench

Stress command completed with bad status 2: fatal error: runtime: out of memory #88