Open fruch opened 1 year ago
Reproduced in 5.0.5 as well with a shorter longevity
Kernel Version: 5.15.0-1018-gcp
Scylla version (or git commit hash): 5.0.5-20221009.5a97a1060
with build-id 5009658b834aaf68970135bfc84f964b66ea4dee
Cluster size: 6 nodes (n1-highmem-16)
Scylla Nodes used in this run:
OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/5935604957551590924
(gce: us-east1)
Test: longevity-10gb-3h-gce-test
Test id: fb62010e-329a-4d63-aeac-50e1c1e13341
Test name: scylla-5.0/longevity/longevity-10gb-3h-gce-test
Test config file(s):
>>>>>>> The test timed out waiting without the c-s threads finishing before the timeout was reached. Looking in the loader logs however, I see that the 2 longest c-s runs finished:
Total GC memory : 0.000 KiB
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
Total GC time : 0.0 seconds
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
Avg GC time : NaN ms
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
StdDev GC time : 0.0 ms
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
Total operation time : 03:00:00
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
END
Yet it still attempted to connect to a node that was (successfully) decommissioned during the test:
om.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-0-db-node-fb62010e-0-2.c.skilled-adapter-452.internal/10.142.0.135:9042] Connection has been closed
<<<<<<<
$ hydra investigate show-monitor fb62010e-329a-4d63-aeac-50e1c1e13341
$ hydra investigate show-logs fb62010e-329a-4d63-aeac-50e1c1e13341
IIRC we had a similar issue, and the problem was something with the driver or the log writing taking too long, so the stress although finished, was stuck writing to logs or something like that... could it be the case, but now, because it is running inside a docker container, we are having problems in writing logs within the container?
IIRC we had a similar issue, and the problem was something with the driver or the log writing taking too long, so the stress although finished, was stuck writing to logs or something like that... could it be the case, but now, because it is running inside a docker container, we are having problems in writing logs within the container?
We are not using docker for cassandra-stress, at least not yet
I think this error might be related to my 2 other commits:
https://github.com/scylladb/scylla-cluster-tests/pull/5074 - where based on duration we timeout stress thread. We add 10 minutes margin to stress command and it might be to little (especially, when no-warmup flag is not set). But this also could be result of c-s hang we notice sometimes (that's why we added this fix).
The other thing is, that we didn't see this error because timeouts were not catched up/working properly - this was fixed in https://github.com/scylladb/scylla-cluster-tests/pull/5311
I propose to prolong this 10m period by 5% of duration (so for short c-s we don't wait too long). Also we could add no-warmup flags everywhere - but I don't know if this would be right approach.
I verified stress logs, and in first issue comment test, in cassandra-stress-l1-c0-k1-30fb05d9-67fe-4988-8df3-997b0aa6cf22.log
from loader-node-2 I can see it got frozen up (we can see in this log also starting lines after 11h which I don't understand). Also in other test I found c-s logs get silent after some time and leading to timeout error.
So I don't think increasing timeout will help.
We need to find out why these c-s threads freeze.
e.g. see cassandra-stress-l0-c0-k1-07f02534-d226-4550-976a-2f795c9801bc.log
from test on master:
Kernel Version: 5.15.0-1021-azure
Scylla version (or git commit hash): 5.2.0~dev-20221011.fcf0628bc5d0
with build-id 8f3895a59a9d0e6bcc3c0f8932dee690e20feff9
Cluster size: 6 nodes (Standard_L8s_v3)
Scylla Nodes used in this run:
OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.2.0-dev-x86_64-2022-10-11T04-25-04Z
(azure: eastus)
Test: longevity-10gb-3h-azure-test
Test id: 2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test config file(s):
$ hydra investigate show-monitor 2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825
$ hydra investigate show-logs 2eb4c2ab-04b5-4d1f-9b7d-29f50ebff825
Got something similar in 2022.2 job (although the test wasn't timed out)
Kernel Version: 5.15.0-1022-aws
Scylla version (or git commit hash): 2022.2.0~rc4-20221106.f5714e0db12f
with build-id f4a927b8a00fbcd8d48640835192aeaa7968b1f2
Relocatable Package: http://downloads.scylladb.com/unstable/scylla-enterprise/enterprise-2022.2/relocatable/2022-11-06T15:44:05Z/scylla-enterprise-x86_64-package.tar.gz
Cluster size: 6 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image: ami-06afe72a2cd370dbb
(aws: eu-west-2)
Test: longevity-50gb-3days-aws-i4i-test
Test id: a94220e9-cdb8-4806-9590-185b346bbad1
Test name: enterprise-2022.2/longevity/longevity-50gb-3days-aws-i4i-test
Test config file(s):
In the beginning two stress threads were started among many others:
2022-11-10 08:45:28.520: (CassandraStressEvent Severity.NORMAL) period_type=begin event_id=6e3c6801-ef57-4c16-b225-6feef198cdab: node=Node longevity-tls-50gb-3d-2022-2-loader-node-a94220e9-3 [18.132.47.93 | 10.3.1.228] (seed: False)
stress_cmd=cassandra-stress mixed cl=QUORUM duration=4320m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=100 -pop seq=1..100000000 -log interval=5
2022-11-10 08:45:42.036: (CassandraStressEvent Severity.NORMAL) period_type=begin event_id=a34b8e2a-6b22-4646-bf4c-68770daf3ff3: node=Node longevity-tls-50gb-3d-2022-2-loader-node-a94220e9-1 [35.176.112.176 | 10.3.3.100] (seed: False)
stress_cmd=cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=4320m -mode cql3 native -rate threads=20
The test run reached the end
2022-11-13 08:57:02.971: (InfoEvent Severity.NORMAL) period_type=not-set event_id=62d39dc9-a9d2-4342-a90b-8d76bf5c2e9c: message=TEST_END
After that we got a timeout error for the stress threads
2022-11-13 08:57:28.879: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=74e00f6b-61c4-4c9f-801e-128a19ee2464, source=LongevityTest.test_custom_time (longevity_test.LongevityTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/longevity_test.py", line 262, in test_custom_time
self.verify_stress_thread(cs_thread_pool=stress)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 1916, in verify_stress_thread
results, errors = cs_thread_pool.verify_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 273, in verify_results
for future in concurrent.futures.as_completed(self.results_futures, timeout=self.timeout):
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 241, in as_completed
raise TimeoutError(
concurrent.futures._base.TimeoutError: 1 (of 3) futures unfinished
All the other stress threads finished, but the ones mentioned above failed:
2022-11-13 08:57:31.945: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=a34b8e2a-6b22-4646-bf4c-68770daf3ff3 duration=3d0h11m49s: node=Node longevity-tls-50gb-3d-2022-2-loader-node-a94220e9-1 [35.176.112.176 | 10.3.3.100] (seed: False)
stress_cmd=cassandra-stress user profile=/tmp/cs_mv_profile.yaml ops'(insert=3,read1=1,read2=1,read3=1)' cl=QUORUM duration=4320m -mode cql3 native -rate threads=20
errors:
Stress command completed with bad status 1: WARN 08:57:16,007 Error creating netty channel to /10.3.0.176:9042 com.datastax.shaded.netty.channe
2022-11-13 08:57:38.326: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=6e3c6801-ef57-4c16-b225-6feef198cdab duration=3d0h12m9s: node=Node longevity-tls-50gb-3d-2022-2-loader-node-a94220e9-3 [18.132.47.93 | 10.3.1.228] (seed: False) stress_cmd=cassandra-stress mixed cl=QUORUM duration=4320m -schema 'replication(factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=100 -pop seq=1..100000000 -log interval=5 errors:
Stress command completed with bad status 1: WARN 08:57:23,897 Error creating netty channel to /10.3.0.170:9042 com.datastax.shaded.netty.channe
- Restore Monitor Stack command: `$ hydra investigate show-monitor a94220e9-cdb8-4806-9590-185b346bbad1`
- Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=a94220e9-cdb8-4806-9590-185b346bbad1)
- Show all stored logs command: `$ hydra investigate show-logs a94220e9-cdb8-4806-9590-185b346bbad1`
## Logs:
- **db-cluster-a94220e9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/db-cluster-a94220e9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/db-cluster-a94220e9.tar.gz)
- **monitor-set-a94220e9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/monitor-set-a94220e9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/monitor-set-a94220e9.tar.gz)
- **loader-set-a94220e9.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/loader-set-a94220e9.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/loader-set-a94220e9.tar.gz)
- **normal-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/normal-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/normal-a94220e9.log.tar.gz)
- **summary-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/summary-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/summary-a94220e9.log.tar.gz)
- **events-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/events-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/events-a94220e9.log.tar.gz)
- **output-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/output-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/output-a94220e9.log.tar.gz)
- **debug-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/debug-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/debug-a94220e9.log.tar.gz)
- **sct-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/sct-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/sct-a94220e9.log.tar.gz)
- **error-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/error-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/error-a94220e9.log.tar.gz)
- **critical-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/critical-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/critical-a94220e9.log.tar.gz)
- **raw_events-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/raw_events-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/raw_events-a94220e9.log.tar.gz)
- **warning-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/warning-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/warning-a94220e9.log.tar.gz)
- **email_data-a94220e9.json.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/email_data-a94220e9.json.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/email_data-a94220e9.json.tar.gz)
- **argus-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/argus-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/argus-a94220e9.log.tar.gz)
- **left_processes-a94220e9.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/left_processes-a94220e9.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/a94220e9-cdb8-4806-9590-185b346bbad1/20221113_091231/left_processes-a94220e9.log.tar.gz)
[Jenkins job URL](https://jenkins.scylladb.com/job/enterprise-2022.2/job/longevity/job/longevity-50gb-3days-aws-i4i-test/11/)
we are most likely seeing something similar in a rolling upgrade test:
Kernel Version: 3.10.0-1160.76.1.el7.x86_64
Scylla version (or git commit hash): 5.0.5-20221009.5a97a1060
with build-id 5009658b834aaf68970135bfc84f964b66ea4dee
Relocatable Package: http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.0/scylla-x86_64-package-5.0.5.0.20221009.5a97a1060.tar.gz
Cluster size: 4 nodes (n1-highmem-8)
Scylla Nodes used in this run:
OS / Image: https://www.googleapis.com/compute/v1/projects/centos-cloud/global/images/family/centos-7
(gce: us-east1)
Test: rolling-upgrade-centos7-test
Test id: edcca94b-e7eb-4751-ba67-513d9f6cc34a
Test name: scylla-5.1/rolling-upgrade/rolling-upgrade-centos7-test
Test config file(s):
>>>>>>>
test timed out during step 5 (out of 10), because these steps took some 30 min more than usually they do (we run multiple times c-s and gemini stress in this test) there are no errors, nor stress command failures, but things seem to be delayed
<<<<<<<
$ hydra investigate show-monitor edcca94b-e7eb-4751-ba67-513d9f6cc34a
$ hydra investigate show-logs edcca94b-e7eb-4751-ba67-513d9f6cc34a
Same issue reproduced with azure:
some cassandra-stress continue to try to connect to removed node after c-s reported results which cause error:
2023-11-12 20:18:50.601: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=3b624bf0-6e0b-4a6a-a174-131a8d2aabc7, source=LongevityTest.test_custom_time (longevity_test.LongevityTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/longevity_test.py", line 166, in test_custom_time
self.verify_stress_thread(cs_thread_pool=stress)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 2165, in verify_stress_thread
results, errors = cs_thread_pool.verify_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 379, in verify_results
results = super().get_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress/base.py", line 88, in get_results
for future in concurrent.futures.as_completed(self.results_futures, timeout=timeout):
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 241, in as_completed
raise TimeoutError(
concurrent.futures._base.TimeoutError: 2 (of 2) futures unfinished
test marked as failed with error or critical error
latest 3 runs on azure
Kernel Version: 5.15.0-1051-azure
Scylla version (or git commit hash): 5.5.0~dev-20231112.8d618bbfc657
with build-id d2846e72c5ad6493e5eb93f1eb9227eca2761f92
Cluster size: 6 nodes (Standard_L8s_v3)
Scylla Nodes used in this run:
OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.5.0-dev-x86_64-2023-11-12T14-55-14
(azure: undefined_region)
Test: longevity-10gb-3h-azure-test
Test id: 5be183a7-473d-45d5-a1c5-fb8308c56f5c
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test config file(s):
@aleksbykov
What are the loader logs showing ?
I don't think that it's related to this year old issue
@fruch @aleksbykov
I can agree with Israel. This error looks related to https://github.com/scylladb/java-driver/issues/258 :
The c-s failed to stop after DecommissionStreamingErr nemesis.
Installation details
Kernel Version: 5.15.0-1020-aws Scylla version (or git commit hash):
5.2.0~dev-20221002.060dda8e00b7
with build-idb4e08d869feb2fecb04c6ea45eb8946cade10c70
Cluster size: 6 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-014a8e66eac5a37c5
(aws: us-east-1)Test:
longevity-50gb-3days
Test id:ee8833a4-d087-47f6-9865-3db37c29f94b
Test name:scylla-master/longevity/longevity-50gb-3days
Test config file(s):Issue description
Test times out at 15:13, while waiting for stress threads to finish:
while most main stress threads are finished, one still fails on the timeout:
looking at the log, seem like it's keep trying to contact non-existing node, and that slows it down a bit.
$ hydra investigate show-monitor ee8833a4-d087-47f6-9865-3db37c29f94b
$ hydra investigate show-logs ee8833a4-d087-47f6-9865-3db37c29f94b
Logs:
Jenkins job URL