s-b stress with use-order-by=desc causes a memory leak

yarongilor commented 1 year ago

Installation details

Kernel Version: 5.15.0-1023-aws Scylla version (or git commit hash): 2022.2.0~rc5-20221121.feb292600fc4 with build-id 172f9538efb0893c97c86cdf05622925159f4fa2

Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-large-partitions-order-by-db-node-10079e1e-9 (54.154.158.2 | 10.4.0.122) (shards: 14)
longevity-large-partitions-order-by-db-node-10079e1e-8 (52.214.87.7 | 10.4.0.76) (shards: 14)
longevity-large-partitions-order-by-db-node-10079e1e-7 (34.244.229.68 | 10.4.0.104) (shards: 14)
longevity-large-partitions-order-by-db-node-10079e1e-6 (3.250.22.76 | 10.4.1.166) (shards: 14)
longevity-large-partitions-order-by-db-node-10079e1e-5 (34.240.8.195 | 10.4.0.201) (shards: 14)
longevity-large-partitions-order-by-db-node-10079e1e-4 (3.252.189.41 | 10.4.0.168) (shards: 14)
longevity-large-partitions-order-by-db-node-10079e1e-3 (54.195.183.204 | 10.4.0.73) (shards: 14)
longevity-large-partitions-order-by-db-node-10079e1e-2 (3.253.40.138 | 10.4.3.75) (shards: 14)
longevity-large-partitions-order-by-db-node-10079e1e-1 (54.155.159.212 | 10.4.0.222) (shards: 14)

OS / Image: ami-0b77b476432e37d90 (aws: eu-west-1)

Test: longevity-large-partition-4days-test-rq2 Test id: 10079e1e-18ba-4ffc-a6b0-e4a399b7d23c Test name: scylla-staging/Longevity_yaron/longevity-large-partition-4days-test-rq2 Test config file(s):

longevity-large-partition-2days.yaml

Issue description

>>>>>>> Scenario: run a new longevity test where scylla-bench uses the parameter of select-order-by=desc in order to test reversed queries. But.. for some reason (perhaps an update after rebase or something) it actually created the table with 'desc' so have to recheck the branch and configuration.. (The file configurations/scylla-bench-stress-order-by-desc.yaml had 'desc' that overrides 'asc' of test-cases/longevity/longevity-large-partition-2days.yaml) The test ran ok with steady load during nemesis for a day, then loader 1 got disconnected and all load stopped. (SCT does not recognise this type of error and continue running nemesis, perhaps a different issue) The loader was not sshable and the ec2 console system log (attached at bottom) doesn't show much as well.

The configuration of test had:

Generated a create-table query with a seed of [933]: 
                    CREATE TABLE IF NOT EXISTS scylla_bench.test (
                    pk bigint,
                    ck bigint,
                    v blob,
                    PRIMARY KEY (pk, ck)
                ) WITH CLUSTERING ORDER BY (ck DESC)

and then s-b ran with:

< t:2022-11-30 10:28:35,638 f:scylla_bench_thread.py l:183  c:sdcm.scylla_bench_thread p:DEBUG > Scylla bench command: scylla-bench -workload=uniform -mode=read -select-order-by=desc -replication-factor=3 -partition-count=800 -clustering-row-count=50000 -clustering-row-size=uniform:100..5120 -rows-per-request=2000 -timeout=240s -retry-number=10 -concurrency=400 -duration=44h -connection-count 400
< t:2022-11-30 10:28:35,640 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=scylla-bench -workload=uniform -mode=read -select-order-by=desc -replication-factor=3 -partition-count=800 -clustering-row-count=50000 -clustering-row-size=uniform:100..5120 -rows-per-request=2000 -timeout=240s -retry-number=10 -concurrency=400 -duration=44h -connection-count 400
< t:2022-11-30 10:28:35,640 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo  docker exec c62a72e0f3eb3523bf8ee4e599cf5d9885bb90d0ee41f79e4cddac690cad03d8 /bin/sh -c 'scylla-bench -workload=uniform -mode=read -select-order-by=desc -replication-factor=3 -partition-count=800 -clustering-row-count=50000 -clustering-row-size=uniform:100..5120 -rows-per-request=2000 -timeout=240s -retry-number=10 -concurrency=400 -duration=44h -connection-count 400 -nodes 10.4.0.222,10.4.3.75,10.4.0.73,10.4.0.168'"...

<<<<<<<

Restore Monitor Stack command: $ hydra investigate show-monitor 10079e1e-18ba-4ffc-a6b0-e4a399b7d23c
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 10079e1e-18ba-4ffc-a6b0-e4a399b7d23c

Logs:

Jenkins job URL

loader1-system-log-i-0ada74851321f4b0d.log

fruch commented 1 year ago

@yarongilor

Seriously? Please supply complete information, and links to jobs and logs.

No one would look into an issue like this one....

yarongilor commented 1 year ago

@yarongilor

Seriously? Please supply complete information, and links to jobs and logs.

No one would look into an issue like this one....

@fruch , you're responding too fast.. i'm just in progress of updating it, waiting for SCT to complete log collection etc.. thanks for noticing anyway

fruch commented 1 year ago

@yarongilor

Seriously? Please supply complete information, and links to jobs and logs.

No one would look into an issue like this one....

@fruch , you're responding too fast.. i'm just in progress of updating it, waiting for SCT to complete log collection etc.. thanks for noticing anyway

What do you mean by disconnected?

Can you add the relevent log lines showing this disconnection ?

yarongilor commented 1 year ago

mode=read -select-order-by=desc

@fruch , disconnected means not sshable, meaning no network, perhaps network daemon is down, perhaps various side effects of a memory leak. So when that disconnection happened, the scylla-bench running in loader-1 couldn't reach any cluster node, responding with permanent errors of:

23h3m16.3s    1838 3611476      0 2.1s   788ms  590ms  471ms  395ms  230ms  251ms  
2022/12/01 10:53:44 EOF
2022/12/01 10:59:06 EOF
2022/12/01 11:01:31 EOF
2022/12/01 11:01:31 EOF
2022/12/01 11:01:32 EOF
2022/12/01 11:01:32 EOF
2022/12/01 11:01:32 EOF
2022/12/01 11:01:32 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe
2022/12/01 11:01:32 writev tcp 10.4.2.12:58521->10.4.0.122:9042: writev: connection timed out
2022/12/01 11:01:33 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe
2022/12/01 11:01:33 EOF
2022/12/01 11:01:33 EOF
2022/12/01 11:01:34 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe
2022/12/01 11:01:34 EOF
2022/12/01 11:01:34 EOF
2022/12/01 11:01:34 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe
2022/12/01 11:01:34 writev tcp 10.4.2.12:60321->10.4.0.222:9042: writev: connection timed out
2022/12/01 11:01:35 writev tcp 10.4.2.12:42561->10.4.3.75:9042: writev: connection timed out
2022/12/01 11:01:35 EOF

The last report of loader-1 found in sct log is:

< t:2022-12-01 08:37:08,354 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-12-01 08:37:08.353: (CassandraStressEvent Severity.NORMAL) period_type=end event_id=1b22c7e7-4308-44da-945c-85b18e4b8123 duration=29s: node=Node longevity-large-partitions-order-by-loader-node-10079e1e-1 [34.245.153.178 | 10.4.2.12] (seed: False)

fruch commented 1 year ago

mode=read -select-order-by=desc

@fruch , disconnected means not sshable, meaning no network, perhaps network daemon is down, perhaps various side effects of a memory leak. So when that disconnection happened, the scylla-bench running in loader-1 couldn't reach any cluster node, responding with permanent errors of:

23h3m16.3s    1838 3611476      0 2.1s   788ms  590ms  471ms  395ms  230ms  251ms  
2022/12/01 10:53:44 EOF
2022/12/01 10:59:06 EOF
2022/12/01 11:01:31 EOF
2022/12/01 11:01:31 EOF
2022/12/01 11:01:32 EOF
2022/12/01 11:01:32 EOF
2022/12/01 11:01:32 EOF
2022/12/01 11:01:32 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe
2022/12/01 11:01:32 writev tcp 10.4.2.12:58521->10.4.0.122:9042: writev: connection timed out
2022/12/01 11:01:33 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe
2022/12/01 11:01:33 EOF
2022/12/01 11:01:33 EOF
2022/12/01 11:01:34 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe
2022/12/01 11:01:34 EOF
2022/12/01 11:01:34 EOF
2022/12/01 11:01:34 writev tcp 10.4.2.12:47131->10.4.0.122:9042: writev: broken pipe
2022/12/01 11:01:34 writev tcp 10.4.2.12:60321->10.4.0.222:9042: writev: connection timed out
2022/12/01 11:01:35 writev tcp 10.4.2.12:42561->10.4.3.75:9042: writev: connection timed out
2022/12/01 11:01:35 EOF

The last report of loader-1 found in sct log is:

< t:2022-12-01 08:37:08,354 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-12-01 08:37:08.353: (CassandraStressEvent Severity.NORMAL) period_type=end event_id=1b22c7e7-4308-44da-945c-85b18e4b8123 duration=29s: node=Node longevity-large-partitions-order-by-loader-node-10079e1e-1 [34.245.153.178 | 10.4.2.12] (seed: False)

Did you looked at the monitor for memory usage of the loader to see that memory leak, if any ?

What's the loader messages logs say, if the network demon was down, or crashed it would be shown there

Is this reproducible?

yarongilor commented 1 year ago

@fruch , the issue wasn't reproduce using a bigger loader instance with a doubled memory size ( instance_type_loader: 'c5n.4xlarge' ).

fruch commented 1 year ago

@fruch , the issue wasn't reproduce using a bigger loader instance with a doubled memory size ( instance_type_loader: 'c5n.4xlarge' ).

you initial report here doesn't looks like a memory leak

yarongilor commented 1 year ago

@fruch , the issue wasn't reproduce using a bigger loader instance with a doubled memory size ( instance_type_loader: 'c5n.4xlarge' ).

you initial report here doesn't looks like a memory leak

@fruch , i don't know what's this graph is. perhaps it's the monitor-node and not the loader.

In any case, we can also try to see if we get a shorter reproducer when using a small instance-type for the loader.

vponomaryov commented 1 year ago

@fruch , the issue wasn't reproduce using a bigger loader instance with a doubled memory size ( instance_type_loader: 'c5n.4xlarge' ).

you initial report here doesn't looks like a memory leak

@fruch , i don't know what's this graph is. perhaps it's the monitor-node and not the loader.

In any case, we can also try to see if we get a shorter reproducer when using a small instance-type for the loader.

@yarongilor is right, the provided graph is for the monitoring node. Proof: Screenshot from 2022-12-06 17-29-11 The graph is for the node with the 10.4.2.28 IP address and it belongs to the monitoring node:

11:45:11  < t:2022-11-30 09:45:11,445 f:cluster.py      l:495  c:sdcm.cluster_aws     p:INFO  > Node longevity-large-partitions-order-by-monitor-node-10079e1e-1 [3.250.82.140 | 10.4.2.28] (seed: False): Detected Linux distribution: CENTOS7

Also, I can add that somewhere in summer I observed the problem described here. When the problem appears even restart of the node doesn't help and the networking access is still lost at that moment. In my case it was appearing only having wrong configuration for the s-b such as read non-existing rows/partitions. But I assume it was only speeding up the appearence of the problem.

So, s-b does something to the machine that the networking/SSH never can come up again.

fruch commented 1 year ago

@fruch , the issue wasn't reproduce using a bigger loader instance with a doubled memory size ( instance_type_loader: 'c5n.4xlarge' ).

you initial report here doesn't looks like a memory leak

@fruch , i don't know what's this graph is. perhaps it's the monitor-node and not the loader. In any case, we can also try to see if we get a shorter reproducer when using a small instance-type for the loader.

@yarongilor is right, the provided graph is for the monitoring node. Proof: The graph is for the node with the 10.4.2.28 IP address and it belongs to the monitoring node:
11:45:11  < t:2022-11-30 09:45:11,445 f:cluster.py      l:495  c:sdcm.cluster_aws     p:INFO  > Node longevity-large-partitions-order-by-monitor-node-10079e1e-1 [3.250.82.140 | 10.4.2.28] (seed: False): Detected Linux distribution: CENTOS7
Also, I can add that somewhere in summer I observed the problem described here. When the problem appears even restart of the node doesn't help and the networking access is still lost at that moment. In my case it was appearing only having wrong configuration for the s-b such as read non-existing rows/partitions. But I assume it was only speeding up the appearence of the problem.

So, s-b does something to the machine that the networking/SSH never can come up again.

as far as I know, there's probably a kernel panic, which render the machine useless (i.e. no network, no ssh, and no traffic) until it's rebooted, question is what's the cause of the kernel panic.

if we are leaking memory, we need some proof of that.

fruch commented 1 year ago

we are collecting it with collectd (and not with node exporter) this is the query that would show memory consumption:

sum(collectd_memory{memory!="used"}) by (instance) / sum(collectd_memory{memory=~"..*"}) by (instance) * 100

and yes, this looks like a memory leak.

avikivity commented 1 year ago

@michoecho in case it's related to your woes with s-b

roydahan commented 1 year ago

@vponomaryov are you trying to look at it? (don't know if you have the capacity) We should probably run a profiler (pprof?), so if you don't have much time, you can ask @yarongilor to run the reproducer and the profiler for few minutes.

A good hint is that it probably happens only with the reversed-query implementation (or at least it's more severe there).

vponomaryov commented 1 year ago

@vponomaryov are you trying to look at it? (don't know if you have the capacity) We should probably run a profiler (pprof?), so if you don't have much time, you can ask @yarongilor to run the reproducer and the profiler for few minutes.

A good hint is that it probably happens only with the reversed-query implementation (or at least it's more severe there).

I played with the pprof yesterday a bit and it didn't show up something useful from that brief attempt. So, need to have deeper dive.

roydahan commented 1 year ago

@piodul would you be able to take a look at this issue?

piodul commented 1 year ago

I found some memory leak in scylla-bench and posted a fix for it: https://github.com/scylladb/scylla-bench/pull/117 . It doesn't seem related to reverse queries, but the behavior looks similar: memory usage fluctuates but it tends to grow. If this issue is reproducible in SCT then it might be worth running the same test after the fix is merged.

yarongilor commented 1 year ago

Looks like the memory leak is not reproduced after ~ 2 days. Tested fix found in - scylla-bench: scylladb/hydra-loaders:scylla-bench-112-fix-channel-leak

https://jenkins.scylladb.com/job/scylla-staging/job/Longevity_yaron/job/longevity-large-partition-4days-test-rq2/22/

scylladb / scylla-bench