nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.49k stars 1.38k forks source link

JetStream memory leak? #4648

Open andreme opened 11 months ago

andreme commented 11 months ago

What version were you using?

2.10.2

What environment was the server running in?

Kubernetes, default configuration from the NATS helm chart, updated to use 2.10.2-alpine with cluster (3 replicas) and jetstream enabled. GOMEMLIMIT is 900MiB, and pods request 1Gi of memory and are limited to 3Gi.

Is this defect reproducible?

Yes, happens again after restarting the NATS servers

Given the capability you are leveraging, describe your expectation?

Consistent low memory usage

Given the expectation, what is the defect you are observing?

Since the upgrade from 2.9.22 to 2.10.2 we have to regularly restart the NATS servers to avoid them running out of memory.

A single stream is used, with some topics shared between all consumers, and some dedicated for single consumers. Messages are normally between 20 bytes and 3kb. Messages are consumed and immediately ack'ed. Consumers are never slow, and there is rarely more than 100kb of data in the queue. Nearly all messages go through JetStream (99.9999%). The number of consumers does not change. JetStream disk usage is < 10mb.

Even nats-2 is affected by the increased memory usage, it never had any consumers.

The memory consistently builds up over time.

Last restart was 13.5 hours ago, info below was captured now.

Stream info:

Information for Stream X created 2023-10-10 01:10:57

              Subjects: A, B, C, D, E.*, F.*
              Replicas: 3
               Storage: File

Options:

             Retention: Interest
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: unlimited
         Maximum Bytes: unlimited
           Maximum Age: 11m0s
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: nats-0
               Replica: nats-1, current, seen 14ms ago
               Replica: nats-2, current, seen 14ms ago

State:

              Messages: 0
                 Bytes: 0 B
        First Sequence: 421,711,295
         Last Sequence: 421,711,294 @ 2023-10-11 09:51:02 UTC
      Active Consumers: 5

Consumer example (settings are the same for all, but topics vary):

Information for Consumer X > Z created 2023-10-10T01:11:01Z

Configuration:                                                           

            Durable Name: Z                                         
               Pull Mode: true                                           
         Filter Subjects: A, E.Z, F.Y, D
          Deliver Policy: New
              Ack Policy: Explicit
                Ack Wait: 30.00s
           Replay Policy: Instant
         Max Ack Pending: 1,000
       Max Waiting Pulls: 512
      Inactive Threshold: 10m0s

Cluster Information:

                    Name: nats
                  Leader: nats-0
                 Replica: nats-1, current, seen 31ms ago
                 Replica: nats-2, current, seen 33ms ago

State:

  Last Delivered Message: Consumer sequence: 11,319,431 Stream sequence: 423,811,923 Last delivery: 36ms ago
    Acknowledgment Floor: Consumer sequence: 11,319,431 Stream sequence: 423,811,923 Last Ack: 33ms ago
        Outstanding Acks: 0 out of maximum 1,000
    Redelivered Messages: 0
    Unprocessed Messages: 0
           Waiting Pulls: 7 of maximum 512

Connections:

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                             Top 6 Connections out of 6 by subs                                                             │
├─────┬──────────────────────────────┬────────┬─────────┬────────────────────┬─────────┬───────────┬─────────────┬─────────────┬──────────┬───────────┬──────┤
│ CID │ Name                         │ Server │ Cluster │ IP                 │ Account │ Uptime    │ In Msgs     │ Out Msgs    │ In Bytes │ Out Bytes │ Subs │
├─────┼──────────────────────────────┼────────┼─────────┼────────────────────┼─────────┼───────────┼─────────────┼─────────────┼──────────┼───────────┼──────┤
│ 69  │ NATS CLI Version development │ nats-0 │ nats    │ 10.244.1.102:42596 │         │ 0s        │ 1           │ 0           │ 254 B    │ 0 B       │ 1    │
│ 49  │                              │ nats-0 │ nats    │ 10.244.0.239:54620 │         │ 13h26m54s │ 85,960,707  │ 4,583,239   │ 33 GiB   │ 367 MiB   │ 3    │
│ 42  │                              │ nats-1 │ nats    │ 10.244.0.143:33546 │         │ 13h24m28s │ 86,083,043  │ 4,592,787   │ 33 GiB   │ 367 MiB   │ 3    │
│ 59  │                              │ nats-0 │ nats    │ 10.244.1.48:47886  │         │ 13h24m28s │ 567,858     │ 550,868     │ 17 MiB   │ 154 MiB   │ 4    │
│ 41  │                              │ nats-1 │ nats    │ 10.244.0.247:47996 │         │ 13h24m28s │ 566,761     │ 549,739     │ 17 MiB   │ 153 MiB   │ 4    │
│ 60  │                              │ nats-0 │ nats    │ 10.244.0.178:36418 │         │ 13h24m28s │ 178,160,113 │ 162,865,183 │ 1.8 GiB  │ 66 GiB    │ 5    │
├─────┼──────────────────────────────┼────────┼─────────┼────────────────────┼─────────┼───────────┼─────────────┼─────────────┼──────────┼───────────┼──────┤
│     │ TOTALS FOR 6 CONNECTIONS     │        │         │                    │         │           │ 351,338,483 │ 173,141,816 │ 68 GIB   │ 67 GIB    │ 20   │
╰─────┴──────────────────────────────┴────────┴─────────┴────────────────────┴─────────┴───────────┴─────────────┴─────────────┴──────────┴───────────┴──────╯

╭────────────────────────────────╮
│     Connections per server     │
├────────┬─────────┬─────────────┤
│ Server │ Cluster │ Connections │
├────────┼─────────┼─────────────┤
│ nats-1 │ nats    │           2 │
│ nats-0 │ nats    │           4 │
╰────────┴─────────┴─────────────╯

Memory: image

derekcollison commented 11 months ago

Could you create a snapshot of the stream and the consumers for us to possibly take a look at?

Also, could you enable profiling and when you see the memory growing take a memory profile of the heap? https://docs.nats.io/running-a-nats-service/nats_admin/profiling

andreme commented 11 months ago

Could you create a snapshot of the stream and the consumers for us to possibly take a look at?

Sent you an email with the backup.

Also, could you enable profiling and when you see the memory growing take a memory profile of the heap? https://docs.nats.io/running-a-nats-service/nats_admin/profiling

I'm unable to to restart the NATS servers without message loss at the moment (client misconfiguration), and we're going back to 2.9.

I will try to reproduce the problem in our staging system, and take a profile when it happens.

andreme commented 11 months ago

Downgraded to 2.9.22, each nats servers doesn't use more than 200MB. Forgot to mention this, we use the new 2.10 feature to filter multiple subjects in the consumer.

derekcollison commented 11 months ago

If you run the 2.10 server but the apps that do not use multi filter subjects per consumer, does the behavior in terms of memory and cpu match the 2.9 server?

derekcollison commented 11 months ago

Best thing would be to get a memory profile along with CPU and a stacksz dump. That possible?

andreme commented 10 months ago

I haven't yet been able to reproduce the issue in staging (with/without subjects filter).

derekcollison commented 10 months ago

We released 2.10.3 as well, just FYI.

andreme commented 9 months ago

I'm closing this because we've completely removed the use of jetstream.

derekcollison commented 9 months ago

Would you be open to jumping on a short call?

alexosumi commented 5 months ago

Could we reopen this issue? I have exactly the same problem, I'm on version 2.10.12, what it seems is that the messages that enter a stream stay in memory even after receiving the ack.

alexosumi commented 5 months ago

image

neilalexander commented 5 months ago

Please can you grab a memory profile? Without that, it is difficult to diagnose. Thanks!

alexosumi commented 5 months ago

Ok, i'm reading about profiling, i'll try to get It tomorrow. Thx

alexosumi commented 5 months ago

Hi, is there an email address I can send these profiling files? I don't feel very comfortable sending it here, but if you tell me it's okay, ok, I'll send it here, I took 3 samples, 1 at the time of restart, another some time later, and one more a few minutes later.

derekcollison commented 5 months ago

You can post here or send to neil@nats.io and derek@nats.io.

alexosumi commented 5 months ago

I sent the files by email, let me know if you need anything and I'll look for it.

What seems to be happening is that the messages that enter the stream go into memory and even after the ack they remain in memory and as a result I have a feeling of memory leak

neilalexander commented 5 months ago

Thanks, I've got them. Taking a look now.

neilalexander commented 5 months ago

Can you please post updated stream info & consumer infos? Also how many consumers are there like this?

It looks like the memory is mostly in newWaitQueue, have you increased the max_waiting on the consumers?

alexosumi commented 5 months ago

Sorry about slow response, i'm Very sick.

O can't send all informations now, but current max_waiting is 10000000, maybe it's a problem?

About Derek answer, my nats cluster version is 2.10.12

Thx

Em qui., 28 de mar. de 2024 14:39, Neil @.***> escreveu:

Can you please post updated stream info & consumer infos? Also how many consumers are there like this?

It looks like the memory is mostly in newWaitQueue, have you increased the max_waiting on the consumers?

— Reply to this email directly, view it on GitHub https://github.com/nats-io/nats-server/issues/4648#issuecomment-2025772526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETU7TY67JOMJ6WIGSYWLADY2RBUZAVCNFSM6AAAAAA53YBSUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRVG43TENJSGY . You are receiving this because you commented.Message ID: @.***>

neilalexander commented 5 months ago

Please try lowering max waiting and let us know if that helps. Setting it that high is not recommended and is typically an indication of a design issue somewhere.

Hope you feel better soon!

alexosumi commented 4 months ago

hi, at last week i changed some stream to reduce max_waiting from 10.000.000 to 200(currently we are using around 100 consumers in parallel) but memory consumption keep growing. at next few days i will change all streams and i post here the result.

alexosumi commented 4 months ago

Hi, I reduced all max_waiting of all streams from 10,000,000 to 200, and the memory keeps growing, but slower than before.

I am sending 2 files for analysis, to see if the problem is still max_waiting

thx

On Mon, Apr 1, 2024 at 5:49 PM Neil @.***> wrote:

Please try lowering max waiting and let us know if that helps. Setting it that high is not recommended and is typically an indication of a design issue somewhere.

Hope you feel better soon!

— Reply to this email directly, view it on GitHub https://github.com/nats-io/nats-server/issues/4648#issuecomment-2030529180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETU7T6HCRYOIYSWLCPNSU3Y3HB4RAVCNFSM6AAAAAA53YBSUWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZQGUZDSMJYGA . You are receiving this because you commented.Message ID: @.***>

dawson0001 commented 3 months ago

Hi, i also see the nats jetstream got memory leak. My consumers are all use Interest Retention, we only have one consumer per stream and the events are acked immediately after processed. Nats server version: 2.10.16-alpine i attach some metrics from Nats Jetstream and the Memory leak below:

When i tried to restart only one instance and keep other two instances running, this instance's memory came down, but other two instances still keep high memory

dawson0001 commented 3 months ago

i think this is very critical issue, we may consider to change the label of this issue to BUG and put it in high priority. Thanks team so much :pray:

fyi @derekcollison

neilalexander commented 3 months ago

Please enable profiling and capture some memory profiles when the memory has built up a bit: https://docs.nats.io/running-a-nats-service/nats_admin/profiling

alexosumi commented 3 months ago

I changed gomemlimit from 6gb to 1gb, and i think it's working better.

Screenshot_2024-05-23-17-50-13-277_com.brave.browser-edit.jpg

dawson0001 commented 3 months ago

Please enable profiling and capture some memory profiles when the memory has built up a bit: https://docs.nats.io/running-a-nats-service/nats_admin/profiling

Hi, here is the nats-js status and allocs files from all 3 nodes in last 17 hours. FYI each nats-js node specs: requests: cpu: "300m" memory: "256Mi" limits: cpu: "500m" memory: "512Mi" Environment: k8s - aws GOMEMLIMIT: 460MiB

image

image

nats-js-cluster-allocs.zip

Thanks team 🙏

dawson0001 commented 3 months ago

image

this is inuse_space today. The inuse_space is a little but the server memory increased to 326Mb already.

# runtime.MemStats
# Alloc = 21065048
# TotalAlloc = 288041678560
# Sys = 57431304
# Lookups = 0
# Mallocs = 2252233889
# Frees = 2252023063
# HeapAlloc = 21065048
# HeapSys = 46923776
# HeapIdle = 18825216
# HeapInuse = 28098560
# HeapReleased = 13516800
# HeapObjects = 210826
# Stack = 3407872 / 3407872
# MSpan = 484320 / 587520
# MCache = 1200 / 15600
# BuckHashSys = 1937663
# GCSys = 3883744
# OtherSys = 675129
# NextGC = 27944760
# LastGC = 1716890496045862234
# PauseNs = [51226 53660 57724 65182 64454 69617 76024 56287 60304 71258 69170 86427 71512 44464 51309 57699 178902 62415 56135 68709 86163 73617 87702 54692 76050 104977 65781 59509 49757 93795 53392 85926 69470 54411 181307 58013 66676 72017 64264 69666 55291 69169 81289 72059 60914 49600 49284 69559 66397 219800 60134 51457 66197 61029 62170 76466 120663 62790 64227 70712 53860 47799 52295 60017 62708 66090 55417 48453 50776 60232 137672 63418 51579 45764 86686 49380 65855 269368 53906 71689 118587 76682 246840 71419 57226 87813 50206 62396 154376 54439 51296 59278 54004 63876 58924 54780 75354 52164 61785 62203 56530 67573 50581 42438 189279 75566 50604 51338 666805 69683 49049 69270 71748 62662 66535 63406 107027 60499 57895 60611 443006 115589 62602 51945 87341 193104 68926 653121 60323 70287 65872 52397 68416 57140 65025 68151 67210 1607469 63657 76532 56013 55066 61387 60526 136303 62647 43822 63585 65483 73029 62448 48897 84801 62323 121557 1050655 60257 69344 65768 53472 58070 45558 67622 115450 51438 70911 73859 55582 62205 70713 113979 79017 60572 48838 111414 136271 95611 374600 63317 99106 57448 61304 62340 65541 62024 63408 52555 56456 62151 49892 51016 53714 45091 60313 80776 53714 64281 69211 58887 61006 63280 60275 74510 67050 65630 59734 62974 76718 53340 65193 68117 52706 63407 61475 84993 188616 71842 149572 53202 58860 49914 51101 83548 62713 73304 60826 143349 52155 66421 66919 66859 84313 62352 48506 69013 56692 310756 60656 74842 73098 60476 65313 74974 85324 63610 61777 58284 51497 59223 108613 44427 60901 64446 61186 71411 63997]
# PauseEnd = [1716885302111695050 1716885326773991492 1716885348493389371 1716885374274948217 1716885399459598222 1716885423183850254 1716885446169218278 1716885469284014372 1716885495444168342 1716885520248148223 1716885546638590190 1716885571174082628 1716885595564356169 1716885618071149349 1716885641004187587 1716885663000875955 1716885686787486928 1716885712337964636 1716885737361138061 1716885765711418873 1716885792453103660 1716885820392127085 1716885842953873187 1716885867519053648 1716885889976660049 1716885913317300760 1716885937375160553 1716885961977536780 1716885988110703680 1716886008876131875 1716886035955013989 1716886060667488767 1716886085170182092 1716886110073982603 1716886133846303740 1716886157573773784 1716886181748586536 1716886203754968318 1716886229994795938 1716886253386890042 1716886278626649330 1716886304842271474 1716886331463788919 1716886358666325979 1716886383085060067 1716886407895559483 1716886431155256685 1716886456159109064 1716886483178442131 1716886507799166739 1716886529296091809 1716886552868309380 1716886577591057332 1716886603334591775 1716886625151483664 1716886651190284724 1716886675877919012 1716886698387424067 1716886721142957141 1716886744350299106 1716886767618680899 1716886790149261895 1716886812478394600 1716886836855202438 1716886858269772327 1716886879774082532 1716886904981653289 1716886929435595669 1716886957044461390 1716886983830305053 1716887008981754143 1716887031173478795 1716887057581308977 1716887080851490120 1716887102139825901 1716887124109703319 1716887147872000879 1716887172405733983 1716887197032667309 1716887219077513466 1716887242805323421 1716887267186434048 1716887292469874649 1716887316101221334 1716887341606593648 1716887369128154463 1716887391569935568 1716887414570534530 1716887436837759207 1716887460973972515 1716887483453591824 1716887509577301890 1716887531276832257 1716887554942449440 1716887578162940672 1716887602753954743 1716887628595066549 1716887654266742733 1716887675988493852 1716887697463257612 1716887719673958749 1716887743744952751 1716887768548640891 1716887791460874324 1716887817222746925 1716887840250070094 1716887862260310079 1716887886851803015 1716887913247628327 1716887941163314458 1716887968846685311 1716887993603353153 1716888016147723645 1716888037524818213 1716888060613307550 1716888080610653043 1716888103734662326 1716888124937010897 1716888152043128441 1716888179544308615 1716888203564183269 1716888226200235770 1716888250271158903 1716888276558805082 1716888298957091274 1716888323231245648 1716888347659060522 1716888372057031645 1716888393448619390 1716888420196358208 1716888444492899262 1716888468691357378 1716888494645140890 1716888516720691502 1716888543294494394 1716888570172201480 1716888591993776698 1716888612994891638 1716888635482684332 1716888658808062521 1716888683186578039 1716888710270896456 1716888734721436794 1716888758757630128 1716888783940945556 1716888809630164799 1716888834160302373 1716888858986701695 1716888884434496945 1716888907282324763 1716888931303296483 1716888956316451604 1716888981384931129 1716889006166734097 1716889031504286742 1716889056578438126 1716889079474677319 1716889102844980809 1716889126159046981 1716889151517803164 1716889176513956841 1716889199622438373 1716889221442101185 1716889246148869565 1716889269492835256 1716889293158557145 1716889319497811204 1716889346183939709 1716889372714983718 1716889398295740307 1716889424947204876 1716889450197144384 1716889474672146396 1716889498609009061 1716889521142992460 1716889546243501679 1716889570296735978 1716889592464411199 1716889617393166902 1716889639548085189 1716889662966114397 1716889685365000066 1716889709369901024 1716889725410322951 1716889748400811931 1716889771647563528 1716889794059040787 1716889818092792728 1716889843227017999 1716889869488622665 1716889893947587164 1716889918751037872 1716889939662831010 1716889964811052734 1716889987238517072 1716890011161651527 1716890034877824247 1716890057523752180 1716890083381846546 1716890107719692556 1716890131979270986 1716890157611585627 1716890182888870620 1716890209366693486 1716890236150234827 1716890257225052902 1716890280381239158 1716890303078766043 1716890327180382758 1716890353349376498 1716890377518994936 1716890398158212668 1716890422349025312 1716890445650850448 1716890469885040412 1716890496045862234 1716884356151825897 1716884381879022307 1716884404679435506 1716884429151304528 1716884449496906532 1716884472222230934 1716884495967161803 1716884519664090834 1716884543895586148 1716884570047335235 1716884596561790567 1716884622546548349 1716884648071719401 1716884674207199561 1716884691063738943 1716884715743970608 1716884736797763362 1716884759257269711 1716884783225275753 1716884810136055799 1716884836166060199 1716884856781708825 1716884879881421548 1716884901179317166 1716884923745871930 1716884943919195639 1716884971170821449 1716884993958334065 1716885016168318839 1716885038918397605 1716885060239892741 1716885083636242458 1716885107875738683 1716885133704216086 1716885157959164395 1716885182230818378 1716885206924618424 1716885232378620084 1716885256164441221 1716885281602430766]
# NumGC = 23000
# NumForcedGC = 0
# GCCPUFraction = 0.00048738736593957415
# DebugGC = false
# MaxRSS = 63049728
dawson0001 commented 3 months ago

here is the heap and allocs today

heap and allocs.zip

datphan310 commented 3 months ago

hi team, @derekcollison @neilalexander, can we take a look this post, i think we may can find some good information here. From what i have read from this post, the reason is use flag "-race" when run Go build but not sure if it is also our reason.

https://forum.golangbridge.org/t/golang-service-running-on-kubernetes-eks-gets-oom-killed-high-res-memory-value-low-runtime-memstats-alloc-value/34541/11

alexosumi commented 3 months ago

@dawson0001 always as you can, do not set resource.request.cpu, and change resource.request.memory same than resource.request.memory in your case 512Mi and, seeing yout screenshots i can´t see a problem, if your server get memory usage near GOMEMLIMIT and cpu load come to be high (greater than your cpu threads, if your machine have 2 vcpu, naturally your load can be 2 and yout will don´t have a serious problem) you have a problem.

about my environment, after deploy release 2.10.14 the issue are resolved.

michaelmohamed commented 2 months ago

I think I am experiencing the same issue here: https://github.com/nats-io/nats-server/issues/5596

arkh-consensys commented 1 month ago

We also see a higher use of memory compared to what is recommended by NATS on resources needed for jetstream

We have a stream with 3clusters, 1 publisher and 1 pull consumer. Msg rate between 500/700 msg/s Memory usage indicates a slight slope of increase which hints to me it could be a memory leak. We use this version of nats 2.10.18-alpine

Screenshot 2024-07-23 at 14 59 41 Screenshot 2024-07-23 at 14 59 50 Screenshot 2024-07-23 at 15 00 18

We also observed improvements after reducing GOMEMLIMIT from 3 to 2 Gib. Furthermore, whenever the NATS storage reaches to 80% of its capacity we observe poor performance of receiving messages and a lot of NATS TIMEOUT errors on the publisher.