Closed yarongilor closed 1 year ago
@yarongilor
Seriously? Please supply complete information, and links to jobs and logs.
No one would look into an issue like this one....
@yarongilor
Seriously? Please supply complete information, and links to jobs and logs.
No one would look into an issue like this one....
@fruch , you're responding too fast.. i'm just in progress of updating it, waiting for SCT to complete log collection etc.. thanks for noticing anyway
@yarongilor
Seriously? Please supply complete information, and links to jobs and logs.
No one would look into an issue like this one....
@fruch , you're responding too fast.. i'm just in progress of updating it, waiting for SCT to complete log collection etc.. thanks for noticing anyway
What do you mean by disconnected?
Can you add the relevent log lines showing this disconnection ?
mode=read -select-order-by=desc
@fruch , disconnected means not sshable, meaning no network, perhaps network daemon is down, perhaps various side effects of a memory leak. So when that disconnection happened, the scylla-bench running in loader-1 couldn't reach any cluster node, responding with permanent errors of:
23h3m16.3s 1838 3611476 0 2.1s 788ms 590ms 471ms 395ms 230ms 251ms
2022/12/01 10:53:44 EOF
2022/12/01 10:59:06 EOF
2022/12/01 11:01:31 EOF
2022/12/01 11:01:31 EOF
2022/12/01 11:01:32 EOF
2022/12/01 11:01:32 EOF
2022/12/01 11:01:32 EOF
2022/12/01 11:01:32 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe
2022/12/01 11:01:32 writev tcp 10.4.2.12:58521->10.4.0.122:9042: writev: connection timed out
2022/12/01 11:01:33 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe
2022/12/01 11:01:33 EOF
2022/12/01 11:01:33 EOF
2022/12/01 11:01:34 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe
2022/12/01 11:01:34 EOF
2022/12/01 11:01:34 EOF
2022/12/01 11:01:34 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe
2022/12/01 11:01:34 writev tcp 10.4.2.12:60321->10.4.0.222:9042: writev: connection timed out
2022/12/01 11:01:35 writev tcp 10.4.2.12:42561->10.4.3.75:9042: writev: connection timed out
2022/12/01 11:01:35 EOF
The last report of loader-1 found in sct log is:
< t:2022-12-01 08:37:08,354 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2022-12-01 08:37:08.353: (CassandraStressEvent Severity.NORMAL) period_type=end event_id=1b22c7e7-4308-44da-945c-85b18e4b8123 duration=29s: node=Node longevity-large-partitions-order-by-loader-node-10079e1e-1 [34.245.153.178 | 10.4.2.12] (seed: False)
mode=read -select-order-by=desc
@fruch , disconnected means not sshable, meaning no network, perhaps network daemon is down, perhaps various side effects of a memory leak. So when that disconnection happened, the scylla-bench running in loader-1 couldn't reach any cluster node, responding with permanent errors of:
23h3m16.3s 1838 3611476 0 2.1s 788ms 590ms 471ms 395ms 230ms 251ms 2022/12/01 10:53:44 EOF 2022/12/01 10:59:06 EOF 2022/12/01 11:01:31 EOF 2022/12/01 11:01:31 EOF 2022/12/01 11:01:32 EOF 2022/12/01 11:01:32 EOF 2022/12/01 11:01:32 EOF 2022/12/01 11:01:32 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe 2022/12/01 11:01:32 writev tcp 10.4.2.12:58521->10.4.0.122:9042: writev: connection timed out 2022/12/01 11:01:33 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe 2022/12/01 11:01:33 EOF 2022/12/01 11:01:33 EOF 2022/12/01 11:01:34 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe 2022/12/01 11:01:34 EOF 2022/12/01 11:01:34 EOF 2022/12/01 11:01:34 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe 2022/12/01 11:01:34 writev tcp 10.4.2.12:60321->10.4.0.222:9042: writev: connection timed out 2022/12/01 11:01:35 writev tcp 10.4.2.12:42561->10.4.3.75:9042: writev: connection timed out 2022/12/01 11:01:35 EOF
The last report of loader-1 found in sct log is:
< t:2022-12-01 08:37:08,354 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > 2022-12-01 08:37:08.353: (CassandraStressEvent Severity.NORMAL) period_type=end event_id=1b22c7e7-4308-44da-945c-85b18e4b8123 duration=29s: node=Node longevity-large-partitions-order-by-loader-node-10079e1e-1 [34.245.153.178 | 10.4.2.12] (seed: False)
Did you looked at the monitor for memory usage of the loader to see that memory leak, if any ?
What's the loader messages logs say, if the network demon was down, or crashed it would be shown there
Is this reproducible?
@fruch , the issue wasn't reproduce using a bigger loader instance with a doubled memory size ( instance_type_loader: 'c5n.4xlarge' ).
@fruch , the issue wasn't reproduce using a bigger loader instance with a doubled memory size ( instance_type_loader: 'c5n.4xlarge' ).
you initial report here doesn't looks like a memory leak
@fruch , the issue wasn't reproduce using a bigger loader instance with a doubled memory size ( instance_type_loader: 'c5n.4xlarge' ).
you initial report here doesn't looks like a memory leak
@fruch , i don't know what's this graph is. perhaps it's the monitor-node and not the loader.
In any case, we can also try to see if we get a shorter reproducer when using a small instance-type for the loader.
@fruch , the issue wasn't reproduce using a bigger loader instance with a doubled memory size ( instance_type_loader: 'c5n.4xlarge' ).
you initial report here doesn't looks like a memory leak
@fruch , i don't know what's this graph is. perhaps it's the monitor-node and not the loader.
In any case, we can also try to see if we get a shorter reproducer when using a small instance-type for the loader.
@yarongilor is right, the provided graph is for the monitoring node. Proof:
The graph is for the node with the 10.4.2.28
IP address and it belongs to the monitoring node:
11:45:11 < t:2022-11-30 09:45:11,445 f:cluster.py l:495 c:sdcm.cluster_aws p:INFO > Node longevity-large-partitions-order-by-monitor-node-10079e1e-1 [3.250.82.140 | 10.4.2.28] (seed: False): Detected Linux distribution: CENTOS7
Also, I can add that somewhere in summer I observed the problem described here.
When the problem appears even restart of the node doesn't help and the networking access is still lost at that moment.
In my case it was appearing only having wrong configuration for the s-b
such as read non-existing rows/partitions
.
But I assume it was only speeding up the appearence of the problem.
So, s-b
does something to the machine that the networking/SSH never can come up again.
@fruch , the issue wasn't reproduce using a bigger loader instance with a doubled memory size ( instance_type_loader: 'c5n.4xlarge' ).
you initial report here doesn't looks like a memory leak
@fruch , i don't know what's this graph is. perhaps it's the monitor-node and not the loader. In any case, we can also try to see if we get a shorter reproducer when using a small instance-type for the loader.
@yarongilor is right, the provided graph is for the monitoring node. Proof: The graph is for the node with the
10.4.2.28
IP address and it belongs to the monitoring node:11:45:11 < t:2022-11-30 09:45:11,445 f:cluster.py l:495 c:sdcm.cluster_aws p:INFO > Node longevity-large-partitions-order-by-monitor-node-10079e1e-1 [3.250.82.140 | 10.4.2.28] (seed: False): Detected Linux distribution: CENTOS7
Also, I can add that somewhere in summer I observed the problem described here. When the problem appears even restart of the node doesn't help and the networking access is still lost at that moment. In my case it was appearing only having wrong configuration for the
s-b
such asread non-existing rows/partitions
. But I assume it was only speeding up the appearence of the problem.So,
s-b
does something to the machine that the networking/SSH never can come up again.
as far as I know, there's probably a kernel panic, which render the machine useless (i.e. no network, no ssh, and no traffic) until it's rebooted, question is what's the cause of the kernel panic.
if we are leaking memory, we need some proof of that.
we are collecting it with collectd (and not with node exporter) this is the query that would show memory consumption:
sum(collectd_memory{memory!="used"}) by (instance) / sum(collectd_memory{memory=~"..*"}) by (instance) * 100
and yes, this looks like a memory leak.
@michoecho in case it's related to your woes with s-b
@vponomaryov are you trying to look at it? (don't know if you have the capacity) We should probably run a profiler (pprof?), so if you don't have much time, you can ask @yarongilor to run the reproducer and the profiler for few minutes.
A good hint is that it probably happens only with the reversed-query implementation (or at least it's more severe there).
@vponomaryov are you trying to look at it? (don't know if you have the capacity) We should probably run a profiler (pprof?), so if you don't have much time, you can ask @yarongilor to run the reproducer and the profiler for few minutes.
A good hint is that it probably happens only with the reversed-query implementation (or at least it's more severe there).
I played with the pprof
yesterday a bit and it didn't show up something useful from that brief attempt.
So, need to have deeper dive.
@piodul would you be able to take a look at this issue?
I found some memory leak in scylla-bench and posted a fix for it: https://github.com/scylladb/scylla-bench/pull/117 . It doesn't seem related to reverse queries, but the behavior looks similar: memory usage fluctuates but it tends to grow. If this issue is reproducible in SCT then it might be worth running the same test after the fix is merged.
Looks like the memory leak is not reproduced after ~ 2 days. Tested fix found in - scylla-bench: scylladb/hydra-loaders:scylla-bench-112-fix-channel-leak
Installation details
Kernel Version: 5.15.0-1023-aws Scylla version (or git commit hash):
2022.2.0~rc5-20221121.feb292600fc4
with build-id172f9538efb0893c97c86cdf05622925159f4fa2
Cluster size: 4 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0b77b476432e37d90
(aws: eu-west-1)Test:
longevity-large-partition-4days-test-rq2
Test id:10079e1e-18ba-4ffc-a6b0-e4a399b7d23c
Test name:scylla-staging/Longevity_yaron/longevity-large-partition-4days-test-rq2
Test config file(s):Issue description
>>>>>>> Scenario: run a new longevity test where scylla-bench uses the parameter of select-order-by=desc in order to test reversed queries. But.. for some reason (perhaps an update after rebase or something) it actually created the table with 'desc' so have to recheck the branch and configuration.. (The file
configurations/scylla-bench-stress-order-by-desc.yaml
had 'desc' that overrides 'asc' oftest-cases/longevity/longevity-large-partition-2days.yaml
) The test ran ok with steady load during nemesis for a day, then loader 1 got disconnected and all load stopped. (SCT does not recognise this type of error and continue running nemesis, perhaps a different issue) The loader was not sshable and the ec2 console system log (attached at bottom) doesn't show much as well.The configuration of test had:
and then s-b ran with:
<<<<<<<
$ hydra investigate show-monitor 10079e1e-18ba-4ffc-a6b0-e4a399b7d23c
$ hydra investigate show-logs 10079e1e-18ba-4ffc-a6b0-e4a399b7d23c
Logs:
No logs captured during this run. +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Log links for testrun with test id 10079e1e-18ba-4ffc-a6b0-e4a399b7d23c | +-----------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Date | Log type | Link | +-----------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 20221130_110838 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221130_110838/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221130_110948-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221130_110838 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221130_110838/grafana-screenshot-overview-20221130_110839-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221130_195014 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221130_195014/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221130_195050-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221130_195014 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221130_195014/grafana-screenshot-overview-20221130_195014-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221130_195821 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221130_195821/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221130_195857-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221130_195821 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221130_195821/grafana-screenshot-overview-20221130_195821-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221201_091457 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_091457/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221201_091534-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221201_091457 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_091457/grafana-screenshot-overview-20221201_091457-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221201_094657 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_094657/grafana-screenshot-longevity-large-partition-4days-test-rq2-scylla-per-server-metrics-nemesis-20221201_094733-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221201_094657 | grafana | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_094657/grafana-screenshot-overview-20221201_094657-longevity-large-partitions-order-by-monitor-node-10079e1e-1.png | | 20221201_125438 | db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_125438/db-cluster-10079e1e.tar.gz | | 20221201_125438 | loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_125438/loader-set-10079e1e.tar.gz | | 20221201_125438 | monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_125438/monitor-set-10079e1e.tar.gz | | 20221201_125438 | sct | https://cloudius-jenkins-test.s3.amazonaws.com/10079e1e-18ba-4ffc-a6b0-e4a399b7d23c/20221201_125438/sct-runner-10079e1e.tar.gz | +-----------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Jenkins job URL
loader1-system-log-i-0ada74851321f4b0d.log