nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.97k stars 1.41k forks source link

Limit policy maximum age didn't cleanup resulting storage fill [v2.10.18] #5795

Open b2broker-yperfilov opened 3 months ago

b2broker-yperfilov commented 3 months ago

Observed behavior

We are using Limit policy with maximum age of 15 minutes. However, 1 of 3 nodes didn't cleanup storage in time, resulting in storage filled and crash.

On the screen below, you can see the storage usage stats of 3 nodes. Notice that blue one has much larger storage usage compared to to red and yellow nodes. CleanShot 2024-08-16 at 13 09 03@2x

The screenshot below is from NATS dashboard, you can see that stream message count also rose significantly CleanShot 2024-08-16 at 13 14 27@2x

Configuration of the stream provided on the screen below. Stream was recreated during attempt to fix the issue, but it has exactly the same settings. Notice Max age here of 15 minutes, as well as typical bytes size and message count. CleanShot 2024-08-16 at 13 10 59@2x

In logs, there were errors (repeated several time):

2024-08-15 19:50:32.572 {"time":"2024-08-15T16:50:32.57225584Z","_p":"F","log":"[181] 2024/08/15 16:50:32.572168 [ERR] JetStream resource limits exceeded for server"}

Please let me know if you need any additional details

Expected behavior

Limit policy cleaning as expected

Server and client version

Server 2.10.18

Host environment

K8s

      resources:
        limits:
          cpu: 400m
          memory: 768Mi
        requests:
          cpu: 400m
          memory: 768Mi

Steps to reproduce

not clear

derekcollison commented 2 months ago

When something like that happens, we request the developer capture some profiles for us, specifically cpu, mem (heap), and stacksz / goroutines.

b2broker-yperfilov commented 2 months ago

@derekcollison here are screenshot of some metrics. I went through many memory metrics, and all of the looks quite stable

CleanShot 2024-08-19 at 08 49 27@2x CleanShot 2024-08-19 at 08 49 19@2x CleanShot 2024-08-19 at 08 46 28@2x CleanShot 2024-08-19 at 08 46 20@2x CleanShot 2024-08-19 at 08 46 12@2x CleanShot 2024-08-19 at 08 46 00@2x

derekcollison commented 2 months ago

The stream info shows the only limit you have in place, which is age, appearing to work correctly. What do you think is not working correctly?

Also do you properly set GOMEMLIMIT?

b2broker-yperfilov commented 2 months ago

@derekcollison we do not have GOMEMLIMIT set. At the same time, issue is not with memory of the pod, issue with disk storage.

We have a replication on 3 nodes for this stream, that means that message should be copied to 3 nodes, and at any time the same amount of space should be occupied on each node (assuming all other stream also having replicas factor 3). However, one of the nodes didn't follow tis rule, as can be seen from the initial message, resulting in disk leackage.

derekcollison commented 2 months ago

Can you share a du -sh from the store directory for the one that has increased disk usage?

b2broker-yperfilov commented 2 months ago

@derekcollison Now it is 1.3G. Another node is 102.0M, another is 97.4M