nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.49k stars 1.38k forks source link

Memory skyrockets after `nats stream view` #3499

Closed Kaarel closed 1 year ago

Kaarel commented 1 year ago

nats-server: 2.9.1 nats: 0.0.34 3 node cluster, AWS EC2 Ubuntu

Load before: before

Load after nats stream view after

After a few seconds CPU returned to before level. After a few minutes memory too returned to before level.

If curious about the max memory limit, its because previously (and with previous nats-server version) these spikes exhausted entire node memory causing node outage. Setting this limit doesn't seem to have affected the normal workings of nats-server.

derekcollison commented 1 year ago

What does nats stream info <STREAM> for the stream in question look like?

derekcollison commented 1 year ago

Any updates?

Kaarel commented 1 year ago

What does nats stream info <STREAM> for the stream in question look like?

nats stream info logs
Information for Stream logs created 2022-07-13 20:07:45

             Subjects: logs.>
             Replicas: 3
              Storage: File

Options:

            Retention: Interest
     Acknowledgements: true
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: unlimited
        Maximum Bytes: 1.0 GiB
          Maximum Age: 5d0h0m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

Cluster Information:

                 Name: nats-cluster
               Leader: n3
              Replica: n1, current, seen 0.02s ago
              Replica: n2, current, seen 0.02s ago

State:

             Messages: 0
                Bytes: 0 B
             FirstSeq: 131,100,079
              LastSeq: 131,100,078 @ 2022-11-04T12:50:01 UTC
     Active Consumers: 2
ripienaar commented 1 year ago

Show consumer list there please? Stream view makes a consumer and in recent versions on recent broker versions they will auto clean up - but earlier there's a chance on non clean exit they would stay and consume some resources

Kaarel commented 1 year ago

Show consumer list there please? Stream view makes a consumer and in recent versions on recent broker versions they will auto clean up - but earlier there's a chance on non clean exit they would stay and consume some resources

This is quite a bit after nats stream view logs

nats consumer report logs
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                          Consumer report for logs with 2 consumers                                          │
├──────────────────┬──────┬────────────┬──────────┬─────────────┬─────────────┬─────────────┬─────────────┬───────────────────┤
│ Consumer         │ Mode │ Ack Policy │ Ack Wait │ Ack Pending │ Redelivered │ Unprocessed │ Ack Floor   │ Cluster           │
├──────────────────┼──────┼────────────┼──────────┼─────────────┼─────────────┼─────────────┼─────────────┼───────────────────┤
│ c1               │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 131,170,411 │ n3, n2, n1*       │
│ c2               │ Push │ Explicit   │ 30.00s   │ 0           │ 0           │ 0           │ 131,170,411 │ n3*, n2, n1       │
╰──────────────────┴──────┴────────────┴──────────┴─────────────┴─────────────┴─────────────┴─────────────┴───────────────────╯
Kaarel commented 1 year ago

Any updates?

The cluster has been running for a while. Btw the reason we do nats stream view every once in a while is we get dangling messages on our streams. And that command seems to clear them out. Although with the above stream config (Retention: Interest) once a message is consumed it should be removed from the stream. And this is what seems to be happening 99.999% of the time. We have a consumer that stores all messages to an external persistent storage. Therefore shouldn't really have any dangling messages. Perhaps that's why nats stream view clears out the messages because they are not really messages more like fantoms? Just a random stupid thought :) Over time the stream message count is fairly stable, presumably because the retention limit keeps dropping older messages.

In any case this what just happened. nats stream report shows two streams where message count > 0 in, one of them is logs. I issued nats stream view logs. When the cli asked if I want to see the next page I chose No. This cleared stream logs, but surprisingly the other stream as well!!

I issued the command on node n1 but CPU and memory spiked on node n2. I did not snapshot node n2 beforehand as I didn't expect that to be affected actually. But after the fact I noticed that it has multiple nats-server processes in top as in my original post. The other two nodes in the cluster report just a single nats-server process in top. Sorry that's a false flag, that turned out to be just an htop setting.

Also before issuing nats stream view logs none of the nodes reported any issues in nats stream report. That is no replicas had an exclamation mark (!) behind them. Immediately after nats stream view logs a bunch of replicas of a number of different streams had ! behind them. After a few minutes when the cluster "cooled down" (CPU and memory back to previous levels) only two streams have ! behind their replica. None of the streams that now have ! had any CLI commands issued against them.

🤷

ripienaar commented 1 year ago

Thanks - so those aren't stream view consumers just wanted to make sure nothing was lingering

Kaarel commented 1 year ago

Btw the exclamation marks in nats stream report disappeared after issuing nats stream cluster step-down twice (first time failed)

nats stream cluster step-down stream1
13:49:44 Requesting leader step down of "n1" in a 3 peer RAFT group
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
nats: error: stream did not elect a new leader in time
nats stream cluster step-down stream1
13:49:56 Requesting leader step down of "n3" in a 3 peer RAFT group
13:49:56 No leader elected
13:49:56 No leader elected
13:49:56 New leader elected "n1"

Information for Stream stream1 created 2022-10-11 13:23:27

             Subjects: subject1
             Replicas: 3
              Storage: File

Options:

            Retention: Interest
     Acknowledgements: true
       Discard Policy: Old
     Duplicate Window: 2m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: unlimited
        Maximum Bytes: 1.0 GiB
          Maximum Age: 5d0h0m0s
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

Cluster Information:

                 Name: nats-cluster
               Leader: n1
              Replica: n3, outdated, seen 0.01s ago, 3 operations behind
              Replica: n2, outdated, seen 0.00s ago, 1,190 operations behind

State:

             Messages: 0
                Bytes: 0 B
             FirstSeq: 1,161
              LastSeq: 1,160 @ 2022-11-04T13:02:11 UTC
     Active Consumers: 1

Not that both replicas are outdated. nats stream info stream1 soon after reports both replicas current.

ripienaar commented 1 year ago

First time running cluster down did change the leadership - but a bug in the nats cli made your CLI wait for a very short period for the change over, I've fixed that bug.

Kaarel commented 1 year ago

Eventually got all the stream message counts to 0 as expected with our stream configurations (Retention: Interest). nats stream report looked good. Proceeded upgrading 2.9.1 nodes to 2.9.5. one by one. Now nats stream report shows 177 messages again on our most busy stream. Issued nats stream view logs and same behaviour. The following few minutes produces this log:

[12868] 2022/11/04 14:16:25.607777 [INF] JetStream cluster new consumer leader for 'USERS > stream5 > consumer5'
[12868] 2022/11/04 14:16:25.874025 [INF] JetStream cluster new stream leader for 'USERS > stream2'
[12868] 2022/11/04 14:16:26.302867 [INF] JetStream cluster new consumer leader for 'USERS > stream9 > consumer9'
[12868] 2022/11/04 14:16:26.553335 [INF] JetStream cluster new consumer leader for 'USERS > stream7 > consumer7'
[12868] 2022/11/04 14:16:39.528965 [WRN] 172.17.0.2:35294 - cid:131 - "v1.4.12:node:gi" - Readloop processing time: 8.444347532s
[12868] 2022/11/04 14:16:39.547047 [INF] JetStream cluster new stream leader for 'USERS > stream6'
[12868] 2022/11/04 14:16:39.548374 [INF] JetStream cluster new consumer leader for 'USERS > stream1 > consumer1'
[12868] 2022/11/04 14:16:39.552887 [INF] JetStream cluster new stream leader for 'USERS > stream3'
[12868] 2022/11/04 14:16:39.555044 [INF] JetStream cluster new stream leader for 'USERS > stream9'
[12868] 2022/11/04 14:16:39.556856 [INF] JetStream cluster new consumer leader for 'USERS > stream10 > consumer10'
[12868] 2022/11/04 14:16:39.563998 [INF] JetStream cluster new stream leader for 'USERS > stream4'
[12868] 2022/11/04 14:16:39.568246 [INF] JetStream cluster new stream leader for 'USERS > stream10'
[12868] 2022/11/04 14:16:39.570201 [INF] JetStream cluster new consumer leader for 'USERS > stream6 > consumer6'
[12868] 2022/11/04 14:16:39.571888 [INF] JetStream cluster new consumer leader for 'USERS > stream2 > consumer2'
[12868] 2022/11/04 14:16:39.575349 [INF] JetStream cluster new consumer leader for 'USERS > stream4 > consumer1'
[12868] 2022/11/04 14:16:39.579897 [INF] JetStream cluster new stream leader for 'USERS > stream7'
[12868] 2022/11/04 14:16:41.013990 [INF] JetStream cluster new consumer leader for 'USERS > logs > consumer_logs_2'
[12868] 2022/11/04 14:16:41.029561 [INF] JetStream cluster new stream leader for 'USERS > logs'
[12868] 2022/11/04 14:16:45.752661 [INF] JetStream cluster new stream leader for 'USERS > stream5'
[12868] 2022/11/04 14:16:46.569608 [INF] Self is new JetStream cluster metadata leader
[12868] 2022/11/04 14:16:46.630000 [INF] JetStream cluster new consumer leader for 'USERS > stream9 > consumer9'
[12868] 2022/11/04 14:16:47.977539 [INF] JetStream cluster new consumer leader for 'USERS > stream1 > consumer1'
[12868] 2022/11/04 14:16:50.084652 [INF] JetStream cluster new stream leader for 'USERS > stream1'
[12868] 2022/11/04 14:16:51.488991 [INF] JetStream cluster new stream leader for 'USERS > stream6'
[12868] 2022/11/04 14:16:52.705926 [INF] JetStream cluster new consumer leader for 'USERS > stream5 > consumer5'
[12868] 2022/11/04 14:17:01.409274 [INF] 172.31.13.117:48934 - rid:143 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 1 chunks of 96 total bytes.
[12868] 2022/11/04 14:17:01.409299 [INF] 172.31.13.117:48934 - rid:143 - Router connection closed: Slow Consumer (Write Deadline)
[12868] 2022/11/04 14:17:02.491665 [INF] 172.31.13.117:6222 - rid:367 - Route connection created
[12868] 2022/11/04 14:17:17.581025 [INF] JetStream cluster no metadata leader
[12868] 2022/11/04 14:17:18.555363 [INF] 172.31.13.117:54358 - rid:369 - Route connection created
[12868] 2022/11/04 14:17:18.592504 [INF] 172.31.13.117:54358 - rid:369 - Router connection closed: Duplicate Route
[12868] 2022/11/04 14:17:21.109839 [INF] JetStream cluster new consumer leader for 'USERS > stream3 > consumer3'
[12868] 2022/11/04 14:17:21.768746 [INF] JetStream cluster new consumer leader for 'USERS > stream4 > consumer1'
[12868] 2022/11/04 14:17:22.821123 [INF] JetStream cluster new stream leader for 'USERS > stream4'
[12868] 2022/11/04 14:17:28.617775 [INF] JetStream cluster new consumer leader for 'USERS > stream1 > consumer1'
[12868] 2022/11/04 14:17:29.156500 [INF] JetStream cluster new consumer leader for 'USERS > stream3 > consumer3'
[12868] 2022/11/04 14:17:29.700658 [INF] JetStream cluster new stream leader for 'USERS > logs'
[12868] 2022/11/04 14:17:30.392949 [INF] JetStream cluster new consumer leader for 'USERS > logs > consumer_logs_2'
[12868] 2022/11/04 14:17:30.393932 [WRN] Failed to install snapshot for 'USERS > logs > consumer_logs_2' [C-R3F-KWx8T2ky]: raft: snapshot can not be installed while catchups running
[12868] 2022/11/04 14:17:30.841793 [INF] JetStream cluster new stream leader for 'USERS > stream2'
[12868] 2022/11/04 14:17:36.915820 [INF] JetStream cluster new metadata leader: n3/nats-cluster

Had to run nats stream cluster step-down on some of the streams that had ! behind their replicas for nats stream report to look good again.

derekcollison commented 1 year ago

What does nats traffic show? Do you see many votes? That usually points to instability due to head of line blocking that can be caused by network saturation.

Kaarel commented 1 year ago

Seems the fact that there are "dangling" messages in the stream is irrelevant and is a completely separate issue. The CPU and memory spike happens even when nats stream report shows 0 messages in all streams.

1. Before nats stream view some-stream

pilt

NATS load on all three nodes of the cluster: pilt

2. Immediately after nats stream view some-stream

pilt

NATS load on all three nodes of the cluster: pilt

After a few minutes NATS load on all three nodes returns to pre nats stream view levels. However often, though not always, some stream replicas have an exclamation mark behind them in nats stream report. After issuing nats stream cluster step-down on every affected stream the exclamation marks are cleared. Here's nats traffic after issuing nats stream cluster step-down on a bunch of streams:

pilt

derekcollison commented 1 year ago

and for the stream in question what does nats stream info <stream> return?

derekcollison commented 1 year ago

And the stream before the view command has no msgs since they have all been acked and removed since interest based stream correct?

derekcollison commented 1 year ago

I am trying to recreate but would it be possible, since the stream itself is small, to snapshot the stream and consumers and allow us to recreate in our environment?

derekcollison commented 1 year ago

I can see a CPU spike, with no messages etc in the stream but very high sequence numbers in the setup you described.

derekcollison commented 1 year ago

Will investigate more.

derekcollison commented 1 year ago

Thanks so much for your patience. I found it and is fixed in PR above. Once merged it will be part of the nightly builds as well.

derekcollison commented 1 year ago

Will be in nightly this evening. Was merged to both main and dev branches.

Kaarel commented 1 year ago

When using synadia/nats-server:nightly-20221107 (Git: [4e8d1ae2]) I haven't been able to cause memory or CPU pikes with nats stream view !

As a side effect nats stream view does not clear up messages in the stream any more. But I guess that's another topic altogether why we get seemingly dangling messages on our streams.

derekcollison commented 1 year ago

Are you still seeing dangling messages appearing after moving to he nightly builds?

Kaarel commented 1 year ago

Are you still seeing dangling messages appearing after moving to he nightly builds?

Correct. I've created a new issue with more details: https://github.com/nats-io/nats-server/issues/3612

derekcollison commented 1 year ago

Thanks will take a look.