scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
56 stars 93 forks source link

MemoryStress cause connection errors in C-S on encryption #5807

Open fruch opened 1 year ago

fruch commented 1 year ago

Issue description

MemoryStress can fail some of c-s connections in the following way

2023-02-11 20:41:22.112: (CassandraStressLogEvent Severity.ERROR) period_type=one-time event_id=e2fce031-805f-41c8-a7a0-5df755224a44 during_nemesis=MemoryStress: type=ConsistencyError regex=Cannot achieve consistency level line_number=895570 node=Node longevity-tls-50gb-3d-5-2-loader-node-0530dbf7-1 [34.244.60.101 | 10.4.1.90] (seed: False)
ERROR 20:41:21,374 Authentication error while creating additional connection (error is: Authentication error on host ip-10-4-3-183.eu-west-1.compute.internal/[10.4.3.183:9042](http://10.4.3.183:9042/): Cannot achieve consistency level for cl QUORUM. Requires 2, alive 1)

from scylla POV is would look like this

2023-02-11T20:41:21+00:00 longevity-tls-50gb-3d-5-2-db-node-0530dbf7-4     !INFO | scylla[34186]:  [shard 10] cql_server - exception while processing connection: std::_Nested_exception<std::system_error> (error GnuTLS:-53, Error in the push function.): std::system_error (error system:32, sendmsg: Broken pipe)

it's seems like the SSL implementation is running out of memory and failing, causing c-s to fail like that.

Impact

this doesn't impact the main c-s load, hence I would recommend to ignore those events during the memory stress nemesis

How frequently does it reproduce?

This is the first time we encounter this type of error in c-s

Installation details

Kernel Version: 5.15.0-1028-aws Scylla version (or git commit hash): 5.2.0~rc1-20230207.8ff4717fd010 with build-id 78fbb2c25e9244a62f57988313388a0260084528

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0a6094bea26a69f97 (aws: eu-west-1)

Test: longevity-50gb-3days-test Test id: 0530dbf7-d9a0-429a-8d7b-47e16846ade2 Test name: scylla-5.2/longevity/longevity-50gb-3days-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 0530dbf7-d9a0-429a-8d7b-47e16846ade2` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=0530dbf7-d9a0-429a-8d7b-47e16846ade2) - Show all stored logs command: `$ hydra investigate show-logs 0530dbf7-d9a0-429a-8d7b-47e16846ade2` ## Logs: - **db-cluster-0530dbf7.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/db-cluster-0530dbf7.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/db-cluster-0530dbf7.tar.gz) - **email_data-0530dbf7.json.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/email_data-0530dbf7.json.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/email_data-0530dbf7.json.tar.gz) - **output-0530dbf7.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/output-0530dbf7.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/output-0530dbf7.log.tar.gz) - **debug-0530dbf7.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/debug-0530dbf7.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/debug-0530dbf7.log.tar.gz) - **events-0530dbf7.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/events-0530dbf7.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/events-0530dbf7.log.tar.gz) - **sct-0530dbf7.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/sct-0530dbf7.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/sct-0530dbf7.log.tar.gz) - **normal-0530dbf7.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/normal-0530dbf7.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/normal-0530dbf7.log.tar.gz) - **argus-0530dbf7.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/argus-0530dbf7.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/argus-0530dbf7.log.tar.gz) - **raw_events-0530dbf7.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/raw_events-0530dbf7.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/raw_events-0530dbf7.log.tar.gz) - **critical-0530dbf7.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/critical-0530dbf7.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/critical-0530dbf7.log.tar.gz) - **warning-0530dbf7.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/warning-0530dbf7.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/warning-0530dbf7.log.tar.gz) - **summary-0530dbf7.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/summary-0530dbf7.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/summary-0530dbf7.log.tar.gz) - **left_processes-0530dbf7.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/left_processes-0530dbf7.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/left_processes-0530dbf7.log.tar.gz) - **error-0530dbf7.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/error-0530dbf7.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/error-0530dbf7.log.tar.gz) - **monitor-set-0530dbf7.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/monitor-set-0530dbf7.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/monitor-set-0530dbf7.tar.gz) - **loader-set-0530dbf7.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/loader-set-0530dbf7.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/0530dbf7-d9a0-429a-8d7b-47e16846ade2/20230212_201319/loader-set-0530dbf7.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-5.2/job/longevity/job/longevity-50gb-3days-test/4/)
mykaul commented 9 months ago

I think we've removed this Nemesis and can close this one?

fruch commented 9 months ago

I think we've removed this Nemesis and can close this one?

If you have wait couple of more weeks, it would have closed on it own :)

but we wanted to introduce it back, once we have a way to limit only to scylla processes