scylladb / scylla-cluster-tests

Tests for Scylla Clusters
GNU Affero General Public License v3.0
52 stars 85 forks source link

[longevity-tls-50gb-3d]: test timeout event that stress finishes in time #5345

Open fruch opened 1 year ago

fruch commented 1 year ago

Installation details

Kernel Version: 5.15.0-1020-aws Scylla version (or git commit hash): 5.2.0~dev-20221002.060dda8e00b7 with build-id b4e08d869feb2fecb04c6ea45eb8946cade10c70

Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-014a8e66eac5a37c5 (aws: us-east-1)

Test: longevity-50gb-3days Test id: ee8833a4-d087-47f6-9865-3db37c29f94b Test name: scylla-master/longevity/longevity-50gb-3days Test config file(s):

Issue description

Test times out at 15:13, while waiting for stress threads to finish:

< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > 2022-10-10 15:13:36.217: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=11a70335-2bb5-44aa-973f-26a68ce1c51e, source=LongevityTest.test_custom_time (longevity_test.LongevityTest)() message=Traceback (most recent call last):
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/longevity_test.py", line 140, in test_custom_time
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     self.verify_stress_thread(cs_thread_pool=stress)
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 1906, in verify_stress_thread
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     results, errors = cs_thread_pool.verify_results()
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 281, in verify_results
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     for future in concurrent.futures.as_completed(self.results_futures, timeout=self.timeout):
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 245, in as_completed
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     waiter.event.wait(wait_timeout)
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/usr/local/lib/python3.10/threading.py", line 600, in wait
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     signaled = self._cond.wait(timeout)
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/usr/local/lib/python3.10/threading.py", line 324, in wait
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     gotit = waiter.acquire(True, timeout)
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 241, in critical_failure_handler
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR >     raise CriticalTestFailure("Critical Error has failed the test")  # pylint: disable=raise-missing-from
< t:2022-10-10 15:13:36,220 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > sdcm.tester.CriticalTestFailure: Critical Error has failed the test

while most main stress threads are finished, one still fails on the timeout:

43826232:< t:2022-10-10 15:03:53,951 f:decorators.py   l:69   c:sdcm.utils.decorators p:DEBUG > '_run': failed with 'RetryableNetworkException('Command did not complete within 259800
seconds!\n\nCommand: \'echo TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1; STRESS_TEST_MARKER=JG6YDU4LBWMI2FJ2CBP7; cassandra-stress read cl=QUORUM duration=4320m -mode cql3 native   use
r=cassandra password=cassandra -rate threads=50 -pop seq=1..100000000 -log interval=5 -transport "truststore=/etc/scylla/ssl_conf/client/cacerts.jks truststore-password=cassandra" -no
de 10.12.9.39,10.12.11.81,10.12.8.12,10.12.9.74,10.12.11.178,10.12.9.114 -errors skip-unsupported-columns\
< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-10-10 15:05:12.385: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id
=f5e9fa3c-51a6-43a9-8a9b-e485c37815ab during_nemesis=NoCorruptRepair duration=3d10h50m53s: node=Node longevity-tls-50gb-3d-master-loader-node-ee8833a4-2 [18.234.34.198 | 10.12.9.30] (seed: False)
< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=cassandra-stress read cl=QUORUM duration=4320m -mode cql3 native  -rate threads=50 -pop seq=1..100000000 -log interval=5
< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > errors:
< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >
< t:2022-10-10 15:05:12,388 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Stress command completed with bad status 1: WARN  15:04:04,928 Error creating netty channel to /10.12.9.114:9042

looking at the log, seem like it's keep trying to contact non-existing node, and that slows it down a bit.

Logs:

Jenkins job URL

KnifeyMoloko commented 1 year ago

Reproduced in 5.0.5 as well with a shorter longevity

Installation details

Kernel Version: 5.15.0-1018-gcp Scylla version (or git commit hash): 5.0.5-20221009.5a97a1060 with build-id 5009658b834aaf68970135bfc84f964b66ea4dee

Cluster size: 6 nodes (n1-highmem-16)

Scylla Nodes used in this run:

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/5935604957551590924 (gce: us-east1)

Test: longevity-10gb-3h-gce-test Test id: fb62010e-329a-4d63-aeac-50e1c1e13341 Test name: scylla-5.0/longevity/longevity-10gb-3h-gce-test Test config file(s):

Issue description

>>>>>>> The test timed out waiting without the c-s threads finishing before the timeout was reached. Looking in the loader logs however, I see that the 2 longest c-s runs finished:

Total GC memory           : 0.000 KiB
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
Total GC time             :    0.0 seconds
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
Avg GC time               :    NaN ms
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
StdDev GC time            :    0.0 ms
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
Total operation time      : 03:00:00
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed

com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
END

Yet it still attempted to connect to a node that was (successfully) decommissioned during the test:

om.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed

<<<<<<<

Logs:

Jenkins job URL

fgelcer commented 1 year ago

IIRC we had a similar issue, and the problem was something with the driver or the log writing taking too long, so the stress although finished, was stuck writing to logs or something like that... could it be the case, but now, because it is running inside a docker container, we are having problems in writing logs within the container?

fruch commented 1 year ago

IIRC we had a similar issue, and the problem was something with the driver or the log writing taking too long, so the stress although finished, was stuck writing to logs or something like that... could it be the case, but now, because it is running inside a docker container, we are having problems in writing logs within the container?

We are not using docker for cassandra-stress, at least not yet

soyacz commented 1 year ago

I think this error might be related to my 2 other commits:

https://github.com/scylladb/scylla-cluster-tests/pull/5074 - where based on duration we timeout stress thread. We add 10 minutes margin to stress command and it might be to little (especially, when no-warmup flag is not set). But this also could be result of c-s hang we notice sometimes (that's why we added this fix).

The other thing is, that we didn't see this error because timeouts were not catched up/working properly - this was fixed in https://github.com/scylladb/scylla-cluster-tests/pull/5311

I propose to prolong this 10m period by 5% of duration (so for short c-s we don't wait too long). Also we could add no-warmup flags everywhere - but I don't know if this would be right approach.

soyacz commented 1 year ago

I verified stress logs, and in first issue comment test, in cassandra-stress-l1-c0-k1-30fb05d9-67fe-4988-8df3-997b0aa6cf22.log from loader-node-2 I can see it got frozen up (we can see in this log also starting lines after 11h which I don't understand). Also in other test I found c-s logs get silent after some time and leading to timeout error.

So I don't think increasing timeout will help.

We need to find out why these c-s threads freeze.

e.g. see cassandra-stress-l0-c0-k1-07f02534-d226-4550-976a-2f795c9801bc.log from test on master:

Installation details

Kernel Version: 5.15.0-1021-azure Scylla version (or git commit hash): 5.2.0~dev-20221011.fcf0628bc5d0 with build-id 8f3895a59a9d0e6bcc3c0f8932dee690e20feff9

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.2.0-dev-x86_64-2022-10-11T04-25-04Z (azure: eastus)

Test: longevity-10gb-3h-azure-test Test id: 2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825 Test name: scylla-master/longevity/longevity-10gb-3h-azure-test Test config file(s):

Issue description

Logs:

Jenkins job URL

ilya-rarov commented 1 year ago

Got something similar in 2022.2 job (although the test wasn't timed out)

Installation details

Kernel Version: 5.15.0-1022-aws Scylla version (or git commit hash): 2022.2.0~rc4-20221106.f5714e0db12f with build-id f4a927b8a00fbcd8d48640835192aeaa7968b1f2 Relocatable Package: http://downloads.scylladb.com/unstable/scylla-enterprise/enterprise-2022.2/relocatable/2022-11-06T15:44:05Z/scylla-enterprise-x86_64-package.tar.gz Cluster size: 6 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

OS / Image: ami-06afe72a2cd370dbb (aws: eu-west-2)

Test: longevity-50gb-3days-aws-i4i-test Test id: a94220e9-cdb8-4806-9590-185b346bbad1 Test name: enterprise-2022.2/longevity/longevity-50gb-3days-aws-i4i-test Test config file(s):

Issue description

In the beginning two stress threads were started among many others:

Stress command completed with bad status 1: WARN 08:57:16,007 Error creating netty channel to /10.3.0.176:9042 com.datastax.shaded.netty.channe

2022-11-13 08:57:38.326: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=6e3c6801-ef57-4c16-b225-6feef198cdab duration=3d0h12m9s: node=Node longevity-tls-50gb-3d-2022-2-loader-node-a94220e9-3 [18.132.47.93 | 10.3.1.228] (seed: False) stress_cmd=cassandra-stress mixed cl=QUORUM duration=4320m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=100 -pop seq=1..100000000 -log interval=5 errors:

Stress command completed with bad status 1: WARN 08:57:23,897 Error creating netty channel to /10.3.0.170:9042 com.datastax.shaded.netty.channe



- Restore Monitor Stack command: `$ hydra investigate show-monitor a94220e9-cdb8-4806-9590-185b346bbad1`
- Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=a94220e9-cdb8-4806-9590-185b346bbad1)
- Show all stored logs command: `$ hydra investigate show-logs a94220e9-cdb8-4806-9590-185b346bbad1`

## Logs:
- **db-cluster-a94220e9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/db-cluster-a94220e9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/db-cluster-a94220e9.tar.gz)
- **monitor-set-a94220e9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/monitor-set-a94220e9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/monitor-set-a94220e9.tar.gz)
- **loader-set-a94220e9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/loader-set-a94220e9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/loader-set-a94220e9.tar.gz)
- **normal-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/normal-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/normal-a94220e9.log.tar.gz)
- **summary-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/summary-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/summary-a94220e9.log.tar.gz)
- **events-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/events-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/events-a94220e9.log.tar.gz)
- **output-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/output-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/output-a94220e9.log.tar.gz)
- **debug-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/debug-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/debug-a94220e9.log.tar.gz)
- **sct-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/sct-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/sct-a94220e9.log.tar.gz)
- **error-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/error-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/error-a94220e9.log.tar.gz)
- **critical-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/critical-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/critical-a94220e9.log.tar.gz)
- **raw_events-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/raw_events-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/raw_events-a94220e9.log.tar.gz)
- **warning-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/warning-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/warning-a94220e9.log.tar.gz)
- **email_data-a94220e9.json.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/email_data-a94220e9.json.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/email_data-a94220e9.json.tar.gz)
- **argus-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/argus-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/argus-a94220e9.log.tar.gz)
- **left_processes-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/left_processes-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/left_processes-a94220e9.log.tar.gz)

[Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2022.2/job/longevity/job/longevity-50gb-3days-aws-i4i-test/11/)
fgelcer commented 1 year ago

we are most likely seeing something similar in a rolling upgrade test:

Installation details

Kernel Version: 3.10.0-1160.76.1.el7.x86_64 Scylla version (or git commit hash): 5.0.5-20221009.5a97a1060 with build-id 5009658b834aaf68970135bfc84f964b66ea4dee Relocatable Package: http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.0/scylla-x86_64-package-5.0.5.0.20221009.5a97a1060.tar.gz Cluster size: 4 nodes (n1-highmem-8)

Scylla Nodes used in this run:

OS / Image: https://www.googleapis.com/compute/v1/projects/centos-cloud/global/images/family/centos-7 (gce: us-east1)

Test: rolling-upgrade-centos7-test Test id: edcca94b-e7eb-4751-ba67-513d9f6cc34a Test name: scylla-5.1/rolling-upgrade/rolling-upgrade-centos7-test Test config file(s):

Issue description

>>>>>>>

test timed out during step 5 (out of 10), because these steps took some 30 min more than usually they do (we run multiple times c-s and gemini stress in this test) there are no errors, nor stress command failures, but things seem to be delayed

<<<<<<<

Logs:

Jenkins job URL

aleksbykov commented 8 months ago

Same issue reproduced with azure:

Issue description

some cassandra-stress continue to try to connect to removed node after c-s reported results which cause error:

2023-11-12 20:18:50.601: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=3b624bf0-6e0b-4a6a-a174-131a8d2aabc7, source=LongevityTest.test_custom_time (longevity_test.LongevityTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/longevity_test.py", line 166, in test_custom_time
self.verify_stress_thread(cs_thread_pool=stress)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 2165, in verify_stress_thread
results, errors = cs_thread_pool.verify_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 379, in verify_results
results = super().get_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress/base.py", line 88, in get_results
for future in concurrent.futures.as_completed(self.results_futures, timeout=timeout):
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 241, in as_completed
raise TimeoutError(
concurrent.futures._base.TimeoutError: 2 (of 2) futures unfinished

Impact

test marked as failed with error or critical error

How frequently does it reproduce?

latest 3 runs on azure

Installation details

Kernel Version: 5.15.0-1051-azure Scylla version (or git commit hash): 5.5.0~dev-20231112.8d618bbfc657 with build-id d2846e72c5ad6493e5eb93f1eb9227eca2761f92

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.5.0-dev-x86_64-2023-11-12T14-55-14 (azure: undefined_region)

Test: longevity-10gb-3h-azure-test Test id: 5be183a7-473d-45d5-a1c5-fb8308c56f5c Test name: scylla-master/longevity/longevity-10gb-3h-azure-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 5be183a7-473d-45d5-a1c5-fb8308c56f5c` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=5be183a7-473d-45d5-a1c5-fb8308c56f5c) - Show all stored logs command: `$ hydra investigate show-logs 5be183a7-473d-45d5-a1c5-fb8308c56f5c` ## Logs: - **db-cluster-5be183a7.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/db-cluster-5be183a7.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/db-cluster-5be183a7.tar.gz) - **sct-runner-events-5be183a7.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/sct-runner-events-5be183a7.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/sct-runner-events-5be183a7.tar.gz) - **sct-5be183a7.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/sct-5be183a7.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/sct-5be183a7.log.tar.gz) - **loader-set-5be183a7.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/loader-set-5be183a7.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/loader-set-5be183a7.tar.gz) - **monitor-set-5be183a7.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/monitor-set-5be183a7.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/monitor-set-5be183a7.tar.gz) - **parallel-timelines-report-5be183a7.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/parallel-timelines-report-5be183a7.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5be183a7-473d-45d5-a1c5-fb8308c56f5c/20231112_202404/parallel-timelines-report-5be183a7.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-10gb-3h-azure-test/251/) [Argus](https://argus.scylladb.com/test/aa05fe67-89bd-4e30-88db-5e2b2e3d986d/runs?additionalRuns[]=5be183a7-473d-45d5-a1c5-fb8308c56f5c)
fruch commented 8 months ago

@aleksbykov

What are the loader logs showing ?

I don't think that it's related to this year old issue

enaydanov commented 8 months ago

@fruch @aleksbykov

I can agree with Israel. This error looks related to https://github.com/scylladb/java-driver/issues/258 :

The c-s failed to stop after DecommissionStreamingErr nemesis.