Open fruch opened 11 months ago
Got similar problem after DecommissionStreamingErr nemesis during longevity-10gb-3h-azure-test (10.0.0.5 is a target node for this nemesis):
TAG: loader_idx:1-cpu_idx:0-keyspace_idx:1
08:37:19,314 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback-tools.xml] at [file:/etc/scylla/cassandra/logback-tools.xml]
08:37:19,315 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback-tools.xml] occurs multiple times on the classpath.
08:37:19,315 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback-tools.xml] occurs at [file:/logback-tools.xml]
08:37:19,315 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback-tools.xml] occurs at [file:/etc/scylla/cassandra/logback-tools.xml]
08:37:19,412 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
08:37:19,413 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
08:37:19,420 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDERR]
08:37:19,428 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
08:37:19,489 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to WARN
08:37:19,489 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDERR] to Logger[ROOT]
08:37:19,489 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
08:37:19,490 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@2781e022 - Registering current configuration as safe fallback point
******************** Stress Settings ********************
Command:
Type: write
Count: -1
Duration: 180 MINUTES
...
===== Using optimized driver!!! =====
Connected to cluster: longevity-10gb-3h-master-db-cluster-6c30fd46, max pending requests per connection null, max connections per host 8
Datatacenter: eastus; Host: /10.0.0.5; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.6; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.7; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.8; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.9; Rack: eastus
Datatacenter: eastus; Host: /10.0.0.10; Rack: eastus
Created keyspaces. Sleeping 6s for propagation.
Sleeping 2s...
Running WRITE with 1000 threads 180 minutes
type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb
Failed to connect over JMX; not collecting these stats
total, 132151, 26430, 26430, 26430, 33.3, 19.7, 109.8, 185.7, 335.8, 459.8, 5.0, 0.00000, 0, 0, 0, 0, 0, 0
total, 366429, 46856, 46856, 46856, 21.3, 17.7, 51.7, 73.6, 110.0, 154.9, 10.0, 0.17570, 0, 0, 0, 0, 0, 0
...
total, 662360477, 78539, 78539, 78539, 12.7, 6.7, 45.6, 75.8, 114.6, 161.6,10800.0, 0.00426, 0, 0, 0, 0, 0, 0
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
total, 662377070, 63469, 63469, 63469, 14.8, 9.3, 46.0, 68.8, 89.3, 100.1,10800.3, 0.00427, 0, 0, 0, 0, 0, 0
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Results:
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Op rate : 61,330 op/s [WRITE: 61,330 op/s]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Partition rate : 61,330 pk/s [WRITE: 61,330 pk/s]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Row rate : 61,330 row/s [WRITE: 61,330 row/s]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency mean : 16.3 ms [WRITE: 16.3 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency median : 6.8 ms [WRITE: 6.8 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency 95th percentile : 61.4 ms [WRITE: 61.4 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency 99th percentile : 149.6 ms [WRITE: 149.6 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency 99.9th percentile : 488.1 ms [WRITE: 488.1 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Latency max : 42312.1 ms [WRITE: 42,312.1 ms]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total partitions : 662,377,070 [WRITE: 662,377,070]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total errors : 0 [WRITE: 0]
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total GC count : 0
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total GC memory : 0.000 KiB
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total GC time : 0.0 seconds
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Avg GC time : NaN ms
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
StdDev GC time : 0.0 ms
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
Total operation time : 03:00:00
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
END
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
...
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Connection has been closed
...
com.datastax.driver.core.exceptions.OperationTimedOutException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Timed out waiting for server response
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-master-db-node-eastus-1.internal.cloudapp.net/10.0.0.5:9042] Connection has been closed
Kernel Version: 5.15.0-1051-azure
Scylla version (or git commit hash): 5.5.0~dev-20231118.6bcf3ac86c64
with build-id 61307c96d481b470546e0fd8d94a69a544c4f05c
Cluster size: 6 nodes (Standard_L8s_v3)
Scylla Nodes used in this run:
OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-5.5.0-dev-x86_64-2023-11-18T06-26-15
(azure: undefined_region)
Test: longevity-10gb-3h-azure-test
Test id: 6c30fd46-3432-426f-824f-362eb6107b4b
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test config file(s):
The job is failing like this at least since build 244.
@avelanarius - is anyone looking at this?
@avelanarius - is anyone looking at this?
Currently no one, with other tasks taking priority.
cassandra-stress hangs after finishing it's stress work:
total, 527518456, 55284, 55284, 55284, 18.3, 12.6, 55.1, 83.6, 105.1, 116.9,10800.9, 0.11788, 0, 0, 0, 0, 0, 0
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Results:
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Op rate : 48,840 op/s [WRITE: 48,840 op/s]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Partition rate : 48,840 pk/s [WRITE: 48,840 pk/s]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Row rate : 48,840 row/s [WRITE: 48,840 row/s]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency mean : 20.3 ms [WRITE: 20.3 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency median : 12.7 ms [WRITE: 12.7 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency 95th percentile : 55.2 ms [WRITE: 55.2 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency 99th percentile : 87.8 ms [WRITE: 87.8 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency 99.9th percentile : 169.2 ms [WRITE: 169.2 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Latency max : 90664.1 ms [WRITE: 90,664.1 ms]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total partitions : 527,518,456 [WRITE: 527,518,456]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total errors : 0 [WRITE: 0]
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total GC count : 0
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total GC memory : 0.000 KiB
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total GC time : 0.0 seconds
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Avg GC time : NaN ms
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
StdDev GC time : 0.0 ms
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
Total operation time : 03:00:00
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
END
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [/10.142.0.121:9042] Connection has been closed
...
Kernel Version: 5.15.0-1048-gcp
Scylla version (or git commit hash): 2024.1.0~rc2-20231217.f57117d9cfe3
with build-id 3a4d2dfe8ef4eef5454badb34d1710a5f36a859c
Cluster size: 6 nodes (n2-highmem-16)
Scylla Nodes used in this run:
OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/2925864426790911027
(gce: undefined_region)
Test: longevity-10gb-3h-gce-test
Test id: f4ac5c0d-4243-4afe-9c28-b9e5fe051b92
Test name: enterprise-2024.1/longevity/longevity-10gb-3h-gce-test
Test config file(s):
One more repro here:
Kernel Version: 5.15.0-1045-aws
Scylla version (or git commit hash): 2023.1.2-20231001.646df23cc4b3
with build-id 367fcf1672d44f5cbddc88f946cf272e2551b85a
Target Scylla version (or git commit hash): 2024.1.0~rc2-20231217.f57117d9cfe3
with build-id 3a4d2dfe8ef4eef5454badb34d1710a5f36a859c
Cluster size: 6 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
OS / Image: ami-0bca3ed0b1ef2f2fa ami-001b0989b7cf54ded
(aws: undefined_region)
Test: rolling-upgrade-multidc-test
Test id: 66b220bf-623f-460a-bd36-d6fa1dc496ba
Test name: enterprise-2024.1/rolling-upgrade/rolling-upgrade-multidc-test
Test config file(s):
the last report from @vponomaryov is something that is reproducible running it now with c-s debug enabled
@avelanarius / @Bouncheck let's try to at least triage it ASAP. We have a quite easy way to reproduce it, let's try to understand where the issue is and if it's a recent regression or not.
I haven't understood the root cause yet, however I have a modification in mind that may help with ensuring that shutdown happens. I'll consult @avelanarius first.
I relaunched Jenkins reproducer with DEBUG logs: https://jenkins.scylladb.com/job/enterprise-2024.1/job/reproducers/job/rolling-upgrade-multidc-test/5/. The run mentioned in here:
the last report from @vponomaryov is something that is reproducible running it now with c-s debug enabled https://argus.scylladb.com/test/7748483c-dd44-4306-9e69-97c97214f482/runs?additionalRuns[]=a242eed1-bb86-48e4-b27b-aec40cbb23db
didn't yield useful logs (we're not even sure that the actual problem reproduced...)
@Bouncheck I see that the latest run reproduced the problem correctly and now there are good DEBUG logs to try to understand the situation - file cassandra-stress-read-l0-c0-k1-2494aaab-4173-4ff4-81e0-1b805462a533.log
.
and as discussed with @roydahan, please try to arrange a quick call with @fruch to make sure we're on the same page (and that we know how to conveniently run the reproducer locally etc.)
(and as I understand even though this is not a P0, but this affects a lot of Enterprise CI jobs so it could become urgent soon)
@Bouncheck any progress on this one?
Nothing definitive yet. I'm currently considering 2 possibilities: something is not being closed properly or/and waiting for all writes to finish (with failures) after initiating graceful shutdown takes just too long.
In the logs of the first example I've seen that one of the cassandra stress runs never outputs "END" or "FAILURE" string at any point, even though the logged times go past the set duration of the stress command. In the same run there is also a single NPE which happens when driver tries to close one connection pool at the very start. Looks like race condition because the code it points to checks for null before calling method. I don't actually know if it causes anything to hang or if it's benign yet.
In the logs of second example I can see that all cassandra stress runs technically end, but there's a lot of driver spam even after that. This is probably because all writes started before shutdown need to finish. In that specific case it's hard to tell if it's happening in reasonable timeframes due to lack of timestamps. In some logs I've seen that sometimes the shutdown can take several (e.g. 5) minutes.
In the logs of the last run I can see that in Events tab in Argus that test failed due to timeout
2024-01-08 19:47:17.500: (TestTimeoutEvent Severity.CRITICAL) period_type=not-set event_id=6b125f8c-60e1-45dc-ad15-6c21b0d15b84, Test started at 2024-01-08 13:47:17, reached it's timeout (360 minute)
and even c-s related error it seems
2024-01-08 19:48:09.758: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=c15c384e-18fb-4a72-acde-492d399ec0d1, source=UpgradeTest.test_generic_cluster_upgrade (upgrade_test.UpgradeTest)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 985, in test_generic_cluster_upgrade
self._run_stress_workload("stress_after_cluster_upgrade", wait_for_finish=True)
File "/home/ubuntu/scylla-cluster-tests/upgrade_test.py", line 911, in _run_stress_workload
self.verify_stress_thread(thread_pool)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 2091, in verify_stress_thread
results, errors = cs_thread_pool.verify_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress_thread.py", line 379, in verify_results
results = super().get_results()
File "/home/ubuntu/scylla-cluster-tests/sdcm/stress/base.py", line 88, in get_results
for future in concurrent.futures.as_completed(self.results_futures, timeout=timeout):
File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 245, in as_completed
waiter.event.wait(wait_timeout)
File "/usr/local/lib/python3.10/threading.py", line 607, in wait
signaled = self._cond.wait(timeout)
File "/usr/local/lib/python3.10/threading.py", line 324, in wait
gotit = waiter.acquire(True, timeout)
File "/home/ubuntu/scylla-cluster-tests/sdcm/tester.py", line 263, in critical_failure_handler
raise CriticalTestFailure("Critical Error has failed the test") # pylint: disable=raise-missing-from
sdcm.tester.CriticalTestFailure: Critical Error has failed the test
But all c-s logs from loader-set do have "END" message and timestamps do not match with the timeout. There is one suspicious c-s log (cassandra-stress-read-l1-c0-k1-bffede14-8924-4da1-9644-3876601c74e9.log
) which does not end with "freed threads" lines. Looks like it ends abruptly but its last timestamped line has timestamp 16:10:57,050 while timeout reported in Argus happens at 19:47:17.500. I am assuming the clocks are not off by more than 1 hour because earliest stress logs I've found were in minutes proximity to the start of the whole test. The times also match the reads and writes charts.
I am not sure how to explain this one.
To expedite fixing the problem, @Bouncheck will try:
@roydahan fyi: @Bouncheck was/is in contact with @fruch to figure out the problem and ask him about SCT/this issue
@fruch / @Bouncheck I think that the longevity-10gb-3h-gce-test is one of the consistent jobs to reproduce it. @fruch maybe we can try to reproduce it with a shorter duration.
tcpdump of the entire test is not feasible in terms of size, so we need to find the correct spot to start it and end it.
To expedite fixing the problem, @Bouncheck will try:
- Rerunning the job with higher timeout (rather than the current +5%)
this can be changed in: https://github.com/scylladb/scylla-cluster-tests/blob/master/sdcm/stress_thread.py#L365
- Maybe adding even more logging (with custom driver)
supply us a fork/branch with those, and we can build a docker image, that we can use for reproducers of this issue
- Maybe running tcpdump (in one run we saw one connection still having a lot of inflight requests which we don't know where they originated from)
- Maybe running many runs in parallel to make sure our iteration time is faster and we hit the problem more often
the cases it's happens the most are: longevity-10gb-3h-azure-test and longevity-10gb-3h-gce-test
I don't mind changing the c-s image in those two, until we can figure this out
@roydahan fyi: @Bouncheck was/is in contact with @fruch to figure out the problem and ask him about SCT/this issue
I have trouble understanding some of the logs.
For example in this run: https://jenkins.scylladb.com/job/scylla-5.4/job/longevity/job/longevity-10gb-3h-gce-test/9/
In the loader-set-3e9f3337/cassandra-stress-write-l0-c0-k1-d967395a-8701-4848-9d4c-5225146c1a96.log
I can see the following lines:
19:51:31,810 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@76a3e297 - Registering current configuration as safe fallback point
******************** Stress Settings ********************
Command:
Type: write
Count: -1
Duration: 180 MINUTES
No Warmup: true
Consistency Level: QUORUM
[...]
WARN 19:53:44,446 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-6.c.sct-project-1.internal/10.142.0.108:9042
[...]
Total operation time : 03:00:00
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
END
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed
WARN 20:42:09,958 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042
[...] stands in for all in-between logs. What confuses me about those logs is that stress command which seems to be configured for 180 minutes and reports to have 3h total running time at the end is enclosed between timestamps which are less than 1h apart, assuming timestamps are correct. Is the total running time something different than how long does the command run? Is there an easy way to confirm at what times the cluster was under load?
I have trouble understanding some of the logs. For example in this run: https://jenkins.scylladb.com/job/scylla-5.4/job/longevity/job/longevity-10gb-3h-gce-test/9/ In the
loader-set-3e9f3337/cassandra-stress-write-l0-c0-k1-d967395a-8701-4848-9d4c-5225146c1a96.log
I can see the following lines:19:51:31,810 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@76a3e297 - Registering current configuration as safe fallback point ******************** Stress Settings ******************** Command: Type: write Count: -1 Duration: 180 MINUTES No Warmup: true Consistency Level: QUORUM [...] WARN 19:53:44,446 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-6.c.sct-project-1.internal/10.142.0.108:9042 [...] Total operation time : 03:00:00 com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed END com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed com.datastax.driver.core.exceptions.TransportException: [longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042] Connection has been closed WARN 20:42:09,958 Error creating netty channel to longevity-10gb-3h-5-4-db-node-3e9f3337-0-1.c.sct-project-1.internal/10.142.0.82:9042
[...] stands in for all in-between logs. What confuses me about those logs is that stress command which seems to be configured for 180 minutes and reports to have 3h total running time at the end is enclosed between timestamps which are less than 1h apart, assuming timestamps are correct. Is the total running time something different than how long does the command run? Is there an easy way to confirm at what times the cluster was under load?
Yes downloading the monitoring data and looking at the metics would help with that
hydra investigate show-monitor f4ac5c0d-4243-4afe-9c28-b9e5fe051b92
You just need to find the relevant test_id for the run you want to check out
@Bouncheck have you made any progress on that? Next week could be a good opportunity to sit together with someone from @fruch team (e.g. @soyacz ) and try debugging it together.
Not really, but I'm not out of options yet. Sounds like a good idea, I'll reach out.
@dkropachev
any idea how to push this on forward, it's happening on almost daily basis on SCT short longevity
@fruch , we will take care of it on next week.
Reproduced with 2024.3.0~dev-20241007
Scylla version: 2024.3.0~dev-20241007.887201968d48
with build-id aadf81f92d9cf1e67f821457d46de618e97c8969
Kernel Version: 6.8.0-1016-aws
Describe your issue in detail and steps it took to produce it.
Describe the impact this issue causes to the user.
Describe the frequency with how this issue can be reproduced.
Cluster size: 3 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image: ami-005d075bf3a6cdfd3
(aws: undefined_region)
Test: scylla-enterprise-perf-regression-predefined-throughput-steps-vnodes
Test id: 4a856b2e-8a6c-4195-93c3-b7ef2881fb1f
Test name: scylla-enterprise/perf-regression/scylla-enterprise-perf-regression-predefined-throughput-steps-vnodes
Test method: performance_regression_gradual_grow_throughput.PerformanceRegressionPredefinedStepsTest.test_mixed_gradual_increase_load
Test config file(s):
@Bouncheck do we have any progress with this issue? (I think have 2 additional ones assigned to you that we believe are being caused by the same root cause).
I've found that variation where c-s stops producing load is reproducible locally. The variation where c-s fails to shutdown correctly and hangs indefinitely I haven't managed to get it reproduced locally with the dtest that was reported to fail.
I'm proceeding with trying to solve the first variation, but I'm not sure if the root cause will be the same. One of the scenarios under which writes stop is a situation where cluster gets overloaded, causing writes to fail 10 times (where 10 is default number of retries). The conditions are that 1 of 3 nodes is down, queries require quorum consistency, but the quorum is still alive. Even if the third node is brought back the driver will reconnect to it, but c-s will still not produce queries, so I'm looking in c-s itself if there is a flaw in error handling there.
I think this one is not actually caused by the java-driver but something on SCT side, unless I'm missing something.
This would also explain why it seems impossible to reproduce locally.
Since I couldn't reproduce it I've been working with a copy of longevity-10gb-3h-gce-test
on jenkins which seems to fail consistently.
Every time I would manually inspect the loader nodes, the stress process inside docker container would always end successfully without much delay. The process would not be listed within a minute past duration when running jps
, ps aux
as a root or top
. I've added echo with exit code right after the stress command and it would output 0 to a file every time.
In the runs that fail the docker container remains alive after c-s finishing and when the test ends successfully the docker container is gone rather quickly.
I'm not sure what component fails there (if any). I'm working on pinpointing the cause.
I tried to investigate with @Bouncheck and we got to conclusions:
4f4905a4-35a0-4b29-b32b-e5b9b4a1eb95
we can see sct got log lines :
< t:2024-10-27 10:32:01,796 f:base.py l:231 c:RemoteLibSSH2CmdRunner p:DEBUG > <10.142.0.60>: WARN 10:21:19,793 Error creating netty channel to longevity-10gb-3h-6-2-modi-db-node-4f4905a4-0-5.c.sct-project-1.internal/10.142.0.55:9042
It also looks like c-s printed these errors after test end as c-s logs from loader looks also as delayed (log from 10:04 was after END message while c-s started around 10:02). Need to investigate this issue in SCT code.
one of my culprit candidate is that there's too much load on sct runner and it can't keep up with logs collection in realtime: Basically, python being bound to 1 cpu, on 4 core machine we can reach max 25% - and we are above that (maybe we can be bit higher than 25% because we use more sub-processes for logs parsing) In other tests sct-runner is mostly idle. Example from sanity 4h test:
@soyacz
All of it sounds to me that we should have logs processing in some perf-oriented lang like rust
. Or in golang
.
Because it is not first problem related to the high load of logs processing in python.
So, first step - look for existing solutions, second - try it out and third, probably, do our own if none of existent suits our needs.
@fruch , @roydahan WDYT?
@soyacz
All of it sounds to me that we should have logs processing in some perf-oriented lang like
rust
. Or ingolang
.Because it is not first problem related to the high load of logs processing in python.
So, first step - look for existing solutions, second - try it out and third, probably, do our own if none of existent suits our needs.
@fruch , @roydahan WDYT?
I'm not yet convinced it's the processing of the test end,
Since we are running it via the docker command line, via ssh, it might slow us as well.
Anyhow we don't expect the stress tool to be printing at errors that high rate, it doesn't help us at all.
Also once we are done with the stress, and stopping the session, we don't care any of those errors anymore
Issue description
During the end of the stress command one node is gone (10.0.1.17), and this driver shutdown seems to be stuck
Impact
Stress command is getting stuck
How frequently does it reproduce?
seen once so far on the nightly runs
Installation details
Kernel Version: 5.15.0-1049-aws Scylla version (or git commit hash):
5.5.0~dev-20231113.7b08886e8dd8
with build-id7548c48606c5b58f282fc2b596019226de0df6ed
Cluster size: 3 nodes (i4i.large)
Scylla Nodes used in this run:
OS / Image:
ami-02ea62bee686ef03b
(aws: undefined_region)Test:
longevity-5gb-1h-NodeTerminateAndReplace-aws-test
Test id:dfb19859-9fbb-4408-a2be-a42c7400f7cb
Test name:scylla-master/nemesis/longevity-5gb-1h-NodeTerminateAndReplace-aws-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor dfb19859-9fbb-4408-a2be-a42c7400f7cb` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=dfb19859-9fbb-4408-a2be-a42c7400f7cb) - Show all stored logs command: `$ hydra investigate show-logs dfb19859-9fbb-4408-a2be-a42c7400f7cb` ## Logs: - **db-cluster-dfb19859.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/db-cluster-dfb19859.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/db-cluster-dfb19859.tar.gz) - **sct-runner-events-dfb19859.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/sct-runner-events-dfb19859.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/sct-runner-events-dfb19859.tar.gz) - **sct-dfb19859.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/sct-dfb19859.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/sct-dfb19859.log.tar.gz) - **loader-set-dfb19859.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/loader-set-dfb19859.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/loader-set-dfb19859.tar.gz) - **monitor-set-dfb19859.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/monitor-set-dfb19859.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/dfb19859-9fbb-4408-a2be-a42c7400f7cb/20231114_120128/monitor-set-dfb19859.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/nemesis/job/longevity-5gb-1h-NodeTerminateAndReplace-aws-test/83/) [Argus](https://argus.scylladb.com/test/0af3b0dc-d7dd-4a83-b906-8fa3ebb0e676/runs?additionalRuns[]=dfb19859-9fbb-4408-a2be-a42c7400f7cb)