rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.5k stars 261 forks source link

rke2 server eats up all the memory #6370

Closed harridu closed 1 week ago

harridu commented 1 month ago

Environmental Info: RKE2 Version:

root@kube005c00:~# /usr/local/bin/rke2 -v
rke2 version v1.28.10+rke2r1 (b0d0d687d98f4fa015e7b30aaf2807b50edcc5d7)
go version go1.21.9 X:boringcrypto

Node(s) CPU architecture, OS, and Version: Debian12 running inside kvm, 4 cores, 32 GByte memory, no swap

root@kube005c00:~# uname -a
Linux kube005c00.ac.aixigo.de 6.1.0-22-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.94-1 (2024-06-21) x86_64 GNU/Linux

Cluster Configuration: 3 controller nodes, 32 GByte RAM and 4 cores each, kvm 6 "real" worker nodes, 512 GByte RAM and 64 cores each All Debian 12, RKE2 1.28.10, managed in Rancher 2.8.5

Describe the bug: On the control plane nodes rke2 uses up quite a big chunk of memory. On the first control node I get

top - 07:34:13 up 2 days, 21:32,  1 user,  load average: 0.49, 0.62, 0.66
Tasks: 223 total,   1 running, 221 sleeping,   0 stopped,   1 zombie
%Cpu(s):  5.2 us,  2.3 sy,  0.0 ni, 91.2 id,  0.4 wa,  0.0 hi,  0.8 si,  0.1 st 
MiB Mem :  32094.8 total,   2515.0 free,  24040.7 used,   6008.8 buff/cache     
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   8054.1 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                         
    879 root      20   0   21.6g  20.4g  72544 S   0.7  65.0 226:22.84 /usr/local/bin/rke2 server                      
   2578 root      20   0 2603984   1.2g  81216 S   9.6   3.9      8,59 kube-apiserver --admission-control-config-file=+
   2314 root      20   0   11.0g 342968 191772 S   9.0   1.0      7,22 etcd --config-file=/var/lib/rancher/rke2/server+
    380 root      20   0  389584 298936 295608 S   0.0   0.9   1:09.86 /lib/systemd/systemd-journald                   
   1433 root      20   0 1356736 143084  41572 S   0.3   0.4  32:01.04 kube-scheduler --permit-port-sharing=true --aut+
   1082 root      20   0 1345244 118440  66816 S   3.0   0.4 179:23.01 kubelet --volume-plugin-dir=/var/lib/kubelet/vo+
   3677 root      20   0 2383748  94176  47724 S   1.7   0.3  94:27.17 calico-node -felix                              
   1045 root      20   0  791928  88868  49780 S   1.0   0.3  63:38.01 containerd -c /var/lib/rancher/rke2/agent/etc/c+
   4851 root      20   0 1347844  88592  62688 S   0.7   0.3  19:36.29 kube-controller-manager --flex-volume-plugin-di+
   1373 root      20   0 1286356  79868  40348 S   0.0   0.2  11:15.56 kube-proxy --cluster-cidr=10.42.0.0/16 --conntr+
   3681 root      20   0 1866344  72388  44660 S   0.0   0.2   1:06.10 calico-node -allocate-tunnel-addrs              
   3683 root      20   0 1866088  71560  42680 S   0.0   0.2   1:05.14 calico-node -status-reporter                    
   3676 root      20   0 1939820  68756  42200 S   0.0   0.2   0:35.75 calico-node -monitor-addresses                  
   3680 root      20   0 1866088  65992  41320 S   0.0   0.2   0:31.11 calico-node -monitor-token                      
   4948 root      20   0 1292736  59024  42068 S   0.3   0.2  13:33.89 cloud-controller-manager                        
    810 root      20   0 1275468  55068  32116 S   2.3   0.2  50:25.00 /usr/local/bin/rancher-system-agent sentinel    
   3523 nobody    20   0  746008  44024  21984 S   0.0   0.1   3:34.64 /bin/system-upgrade-controller                  

That is 20 GByte rss. On the other control plane nodes it is "just" 3 GByte. Still way too much for 3 days uptime. Memory usage increases over time, till the first control plane nodes runs into OOM.

The worker nodes seem fine.

Steps To Reproduce: Setup a cluster using Rancher 2.8.5 and RKE2 1.28.10 and see it grow. If I use RKE2 on the command line to setup a cluster there is no such problem.

harridu commented 1 month ago

PS: K3s has the same problem.

serhiynovos commented 1 month ago

@harridu I had exactly the same issue https://github.com/rancher/rke2/issues/6249

Do you store etcd snapshots on s3 storage ?

brandond commented 1 month ago

Please upgrade to v1.28.11+rke2r1

serhiynovos commented 1 month ago

@brandond BTW I still don't see option to upgrade rke2 to 2.18.11 from Rancher. Do you have any info when it will be available? Because I still during few weeks have to go and manually clear my bucket

harridu commented 1 month ago

@serhiynovos , yes, I am using a local minio to store a copy of the snapshots.

Update: S3 snapshots were on, but they are disabled right now. Only local storage.

serhiynovos commented 1 month ago

@harridu Please check your bucket. There should be a lot of snapshots. You can clean them manually and see if it will resolve the issue.

harridu commented 1 month ago

deleted. How comes the S3 storage is still in use, even though it is disabled in the GUI?

harridu commented 1 month ago

After removing all backups in S3 rke2 memory usage stays low, as it seems.

I still have no idea why it backups to S3 at all. If 1.28.11 provides some fixes, then please make it available in Rancher 2.8.5.

brandond commented 1 month ago

I still have no idea why it backups to S3 at all.

I'm not sure what you mean. Why does it back up to S3 when you configure s3 for backups?

serhiynovos commented 1 month ago

I still have no idea why it backups to S3 at all.

I'm not sure what you mean. Why does it back up to S3 when you configure s3 for backups?

@brandond i think @harridu means that on rancher ui he disabled s3 backups but rke2 still uploads them on s3 storage.

Btw finally got 1.28.11 version on Rancher. Issue with s3 is resolved

mikejoh commented 3 weeks ago

Please upgrade to v1.28.11+rke2r1

For us that are on v2.8.4 or v2.8.5 of Rancher without Prime don't have the option to pick 1.28.11+rke2r1, it's not part of the supported rke2 releases at least. Technically we could of course deploy 1.28.11 but we have had problems before when deploying a later version than the one specified for the specific Rancher release.

Any suggestions or input on this?

brandond commented 3 weeks ago

Upgrade to 2.9.0, or deal with running newer RKE2 releases that are not technically supported by the patch release of Rancher that you're on.

mikejoh commented 2 weeks ago

Upgrade to 2.9.0, or deal with running newer RKE2 releases that are not technically supported by the patch release of Rancher that you're on.

Thanks!

As a side note: We just noticed that when we in Rancher (the UI) check RKE2 versions we can select version 1.28.11 but in the release notes for the version we're on 2.8.4 that version is not mentioned: https://github.com/rancher/rancher/releases/tag/v2.8.4. Is the list of versions we can upgrade to dynamically updated perhaps?

brandond commented 2 weeks ago

It is, yes.

brandond commented 1 week ago

Closing as resolved in releases that have a fix for the s3 snapshot prune issue.