k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.06k stars 2.28k forks source link

K3s etcd snapshot reconcile consumes excessive memory when a large number of snapshots are present #10450

Open purha opened 1 week ago

purha commented 1 week ago

Environmental Info: K3s Version: v1.28.10+k3s1 (a4c5612e) go version go1.21.9

but also affects some earlier versions.

Node(s) CPU architecture, OS, and Version: Linux kubernetes-worker-f-1 6.1.0-22-arm64 #1 SMP Debian 6.1.94-1 (2024-06-21) aarch64 GNU/Linux

Issue exists also atleast on ubuntu 22.04 arm64.

Cluster Configuration: 3 servers with all roles in all of them

Describe the bug:

In case when there's lots of snapshots in s3, at some point k3s will consume all memory available, and then oom killer will kick in. Normally snapshots get cleaned, but because of bug https://github.com/k3s-io/k3s/issues/10292 they dont. This will only affect single node in cluster.

Steps To Reproduce:

Expected behavior:

Memory should not run out.

Actual behavior:

Memory runs out on node, causing oom killer to start killing processes, restarting k3s service will fix the issue for some time until memory runs out again.

Additional context / logs:

Memory consumption example graph attached.

Screenshot 2024-07-04 at 20 08 10
brandond commented 1 week ago

In case when there's lots of snapshots in s3, at some point k3s will consume all memory available, and then oom killer will kick in.

Is there actually a cumulative memory leak, or is the memory required to manage the snapshots directly proportional to the number of snapshots found on disk and in S3?

If there is a cumulative memory leak, this should show up as increasing memory usage over time despite a static number of etcd snapshots.

purha commented 1 week ago

In case when there's lots of snapshots in s3, at some point k3s will consume all memory available, and then oom killer will kick in.

Is there actually a cumulative memory leak, or is the memory required to manage the snapshots directly proportional to the number of snapshots found on disk and in S3?

If there is a cumulative memory leak, this should show up as increasing memory usage over time despite a static number of etcd snapshots.

Seems like cumulative, amount of snapshots contribute to the time that it takes the memory to run out. By disabling S3 snapshots the issue is gone and the memory usage is stable.

brandond commented 1 week ago

What are the units on your graph? Can you show the actual memory utilization of the k3s process in bytes? How many snapshots did you have in the cluster when you saw the memory utilization growing?

I'm trying to reproduce this by profiling k3s with s3 enabled, retention set to 120, and snapshots taken 1 per minute, but I'm not quite sure that I'm seeing the exact same thing as you.

brandond commented 1 week ago

I am also curious if you've tried adding a memory limit to the k3s systemd unit. By default the k3s systemd unit does not have a memory limit on it, and without any external memory pressure, golang will not free memory back to the operating system. So you could just be seeing secondary effects of k3s requiring more memory to reconcile a large number of snapshots, and golang not freeing memory until it absolutely needs to.

brandond commented 1 week ago

Just to share what I'm seeing: I do see k3s allocating a lot of memory while reconciling snapshots, but this memory is freed at the end of each snapshot save cycle. Note that the memory is allocated but no longer in use, which means that it is available to be freed or reused. This is NOT a leak, but I can try to see if there is some potential for enhancement here to avoid the momentary spike in memory during reconcile.

alloc_space

image

inuse_space

image
brandond commented 1 week ago

Just from glancing at this, I suspect that just adding some pagination to the various list operations would take the memory utilization down a lot. The current code pulls a full list into memory on every pass, which will be expensive with hundreds of snapshots.

The profiling also makes it clear that this is NOT a leak, and is not related to minio. So I am going to edit the issue title to better reflect the root of the problem.

image

purha commented 1 week ago

I dont have the data anymore, but I think there was around 300 snapshots or more, from 60 days or so, and few on-demand snapshots, I deleted all but the last 14 days and you can see the from the graph that it slightly helped. And you can also see when I disabled the snapshots all together. Attached also the heap profile that I took, however at that point the k3s was already consuming gigabytes of memory. I didn't try to set memory limits for the service.

profile008

Screenshot 2024-07-09 at 10 05 50
purha commented 1 week ago

The usage is in percents and that's a node with 8Gb of memory