Closed Kaarel closed 1 year ago
What does nats stream info <STREAM>
for the stream in question look like?
Any updates?
What does
nats stream info <STREAM>
for the stream in question look like?
nats stream info logs
Information for Stream logs created 2022-07-13 20:07:45
Subjects: logs.>
Replicas: 3
Storage: File
Options:
Retention: Interest
Acknowledgements: true
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: unlimited
Maximum Bytes: 1.0 GiB
Maximum Age: 5d0h0m0s
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats-cluster
Leader: n3
Replica: n1, current, seen 0.02s ago
Replica: n2, current, seen 0.02s ago
State:
Messages: 0
Bytes: 0 B
FirstSeq: 131,100,079
LastSeq: 131,100,078 @ 2022-11-04T12:50:01 UTC
Active Consumers: 2
Show consumer list there please? Stream view makes a consumer and in recent versions on recent broker versions they will auto clean up - but earlier there's a chance on non clean exit they would stay and consume some resources
Show consumer list there please? Stream view makes a consumer and in recent versions on recent broker versions they will auto clean up - but earlier there's a chance on non clean exit they would stay and consume some resources
This is quite a bit after nats stream view logs
nats consumer report logs
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Consumer report for logs with 2 consumers │
├──────────────────┬──────┬────────────┬──────────┬─────────────┬─────────────┬─────────────┬─────────────┬───────────────────┤
│ Consumer │ Mode │ Ack Policy │ Ack Wait │ Ack Pending │ Redelivered │ Unprocessed │ Ack Floor │ Cluster │
├──────────────────┼──────┼────────────┼──────────┼─────────────┼─────────────┼─────────────┼─────────────┼───────────────────┤
│ c1 │ Push │ Explicit │ 30.00s │ 0 │ 0 │ 0 │ 131,170,411 │ n3, n2, n1* │
│ c2 │ Push │ Explicit │ 30.00s │ 0 │ 0 │ 0 │ 131,170,411 │ n3*, n2, n1 │
╰──────────────────┴──────┴────────────┴──────────┴─────────────┴─────────────┴─────────────┴─────────────┴───────────────────╯
Any updates?
The cluster has been running for a while. Btw the reason we do nats stream view
every once in a while is we get dangling messages on our streams. And that command seems to clear them out. Although with the above stream config (Retention: Interest) once a message is consumed it should be removed from the stream. And this is what seems to be happening 99.999% of the time. We have a consumer that stores all messages to an external persistent storage. Therefore shouldn't really have any dangling messages. Perhaps that's why nats stream view
clears out the messages because they are not really messages more like fantoms? Just a random stupid thought :) Over time the stream message count is fairly stable, presumably because the retention limit keeps dropping older messages.
In any case this what just happened. nats stream report
shows two streams where message count > 0 in, one of them is logs
. I issued nats stream view logs
. When the cli asked if I want to see the next page I chose No
. This cleared stream logs
, but surprisingly the other stream as well!!
I issued the command on node n1
but CPU and memory spiked on node n2
. I did not snapshot node n2
beforehand as I didn't expect that to be affected actually. But after the fact I noticed that it has multiple Sorry that's a false flag, that turned out to be just an nats-server
processes in top
as in my original post. The other two nodes in the cluster report just a single nats-server
process in top.htop
setting.
Also before issuing nats stream view logs
none of the nodes reported any issues in nats stream report
. That is no replicas had an exclamation mark (!) behind them. Immediately after nats stream view logs
a bunch of replicas of a number of different streams had !
behind them. After a few minutes when the cluster "cooled down" (CPU and memory back to previous levels) only two streams have !
behind their replica. None of the streams that now have !
had any CLI commands issued against them.
🤷
Thanks - so those aren't stream view consumers just wanted to make sure nothing was lingering
Btw the exclamation marks in nats stream report
disappeared after issuing nats stream cluster step-down
twice (first time failed)
nats stream cluster step-down stream1
13:49:44 Requesting leader step down of "n1" in a 3 peer RAFT group
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
13:49:45 No leader elected
nats: error: stream did not elect a new leader in time
nats stream cluster step-down stream1
13:49:56 Requesting leader step down of "n3" in a 3 peer RAFT group
13:49:56 No leader elected
13:49:56 No leader elected
13:49:56 New leader elected "n1"
Information for Stream stream1 created 2022-10-11 13:23:27
Subjects: subject1
Replicas: 3
Storage: File
Options:
Retention: Interest
Acknowledgements: true
Discard Policy: Old
Duplicate Window: 2m0s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: unlimited
Maximum Bytes: 1.0 GiB
Maximum Age: 5d0h0m0s
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: nats-cluster
Leader: n1
Replica: n3, outdated, seen 0.01s ago, 3 operations behind
Replica: n2, outdated, seen 0.00s ago, 1,190 operations behind
State:
Messages: 0
Bytes: 0 B
FirstSeq: 1,161
LastSeq: 1,160 @ 2022-11-04T13:02:11 UTC
Active Consumers: 1
Not that both replicas are outdated. nats stream info stream1
soon after reports both replicas current.
First time running cluster down did change the leadership - but a bug in the nats cli made your CLI wait for a very short period for the change over, I've fixed that bug.
Eventually got all the stream message counts to 0 as expected with our stream configurations (Retention: Interest). nats stream report
looked good. Proceeded upgrading 2.9.1 nodes to 2.9.5. one by one. Now nats stream report
shows 177 messages again on our most busy stream. Issued nats stream view logs
and same behaviour. The following few minutes produces this log:
[12868] 2022/11/04 14:16:25.607777 [INF] JetStream cluster new consumer leader for 'USERS > stream5 > consumer5'
[12868] 2022/11/04 14:16:25.874025 [INF] JetStream cluster new stream leader for 'USERS > stream2'
[12868] 2022/11/04 14:16:26.302867 [INF] JetStream cluster new consumer leader for 'USERS > stream9 > consumer9'
[12868] 2022/11/04 14:16:26.553335 [INF] JetStream cluster new consumer leader for 'USERS > stream7 > consumer7'
[12868] 2022/11/04 14:16:39.528965 [WRN] 172.17.0.2:35294 - cid:131 - "v1.4.12:node:gi" - Readloop processing time: 8.444347532s
[12868] 2022/11/04 14:16:39.547047 [INF] JetStream cluster new stream leader for 'USERS > stream6'
[12868] 2022/11/04 14:16:39.548374 [INF] JetStream cluster new consumer leader for 'USERS > stream1 > consumer1'
[12868] 2022/11/04 14:16:39.552887 [INF] JetStream cluster new stream leader for 'USERS > stream3'
[12868] 2022/11/04 14:16:39.555044 [INF] JetStream cluster new stream leader for 'USERS > stream9'
[12868] 2022/11/04 14:16:39.556856 [INF] JetStream cluster new consumer leader for 'USERS > stream10 > consumer10'
[12868] 2022/11/04 14:16:39.563998 [INF] JetStream cluster new stream leader for 'USERS > stream4'
[12868] 2022/11/04 14:16:39.568246 [INF] JetStream cluster new stream leader for 'USERS > stream10'
[12868] 2022/11/04 14:16:39.570201 [INF] JetStream cluster new consumer leader for 'USERS > stream6 > consumer6'
[12868] 2022/11/04 14:16:39.571888 [INF] JetStream cluster new consumer leader for 'USERS > stream2 > consumer2'
[12868] 2022/11/04 14:16:39.575349 [INF] JetStream cluster new consumer leader for 'USERS > stream4 > consumer1'
[12868] 2022/11/04 14:16:39.579897 [INF] JetStream cluster new stream leader for 'USERS > stream7'
[12868] 2022/11/04 14:16:41.013990 [INF] JetStream cluster new consumer leader for 'USERS > logs > consumer_logs_2'
[12868] 2022/11/04 14:16:41.029561 [INF] JetStream cluster new stream leader for 'USERS > logs'
[12868] 2022/11/04 14:16:45.752661 [INF] JetStream cluster new stream leader for 'USERS > stream5'
[12868] 2022/11/04 14:16:46.569608 [INF] Self is new JetStream cluster metadata leader
[12868] 2022/11/04 14:16:46.630000 [INF] JetStream cluster new consumer leader for 'USERS > stream9 > consumer9'
[12868] 2022/11/04 14:16:47.977539 [INF] JetStream cluster new consumer leader for 'USERS > stream1 > consumer1'
[12868] 2022/11/04 14:16:50.084652 [INF] JetStream cluster new stream leader for 'USERS > stream1'
[12868] 2022/11/04 14:16:51.488991 [INF] JetStream cluster new stream leader for 'USERS > stream6'
[12868] 2022/11/04 14:16:52.705926 [INF] JetStream cluster new consumer leader for 'USERS > stream5 > consumer5'
[12868] 2022/11/04 14:17:01.409274 [INF] 172.31.13.117:48934 - rid:143 - Slow Consumer Detected: WriteDeadline of 10s exceeded with 1 chunks of 96 total bytes.
[12868] 2022/11/04 14:17:01.409299 [INF] 172.31.13.117:48934 - rid:143 - Router connection closed: Slow Consumer (Write Deadline)
[12868] 2022/11/04 14:17:02.491665 [INF] 172.31.13.117:6222 - rid:367 - Route connection created
[12868] 2022/11/04 14:17:17.581025 [INF] JetStream cluster no metadata leader
[12868] 2022/11/04 14:17:18.555363 [INF] 172.31.13.117:54358 - rid:369 - Route connection created
[12868] 2022/11/04 14:17:18.592504 [INF] 172.31.13.117:54358 - rid:369 - Router connection closed: Duplicate Route
[12868] 2022/11/04 14:17:21.109839 [INF] JetStream cluster new consumer leader for 'USERS > stream3 > consumer3'
[12868] 2022/11/04 14:17:21.768746 [INF] JetStream cluster new consumer leader for 'USERS > stream4 > consumer1'
[12868] 2022/11/04 14:17:22.821123 [INF] JetStream cluster new stream leader for 'USERS > stream4'
[12868] 2022/11/04 14:17:28.617775 [INF] JetStream cluster new consumer leader for 'USERS > stream1 > consumer1'
[12868] 2022/11/04 14:17:29.156500 [INF] JetStream cluster new consumer leader for 'USERS > stream3 > consumer3'
[12868] 2022/11/04 14:17:29.700658 [INF] JetStream cluster new stream leader for 'USERS > logs'
[12868] 2022/11/04 14:17:30.392949 [INF] JetStream cluster new consumer leader for 'USERS > logs > consumer_logs_2'
[12868] 2022/11/04 14:17:30.393932 [WRN] Failed to install snapshot for 'USERS > logs > consumer_logs_2' [C-R3F-KWx8T2ky]: raft: snapshot can not be installed while catchups running
[12868] 2022/11/04 14:17:30.841793 [INF] JetStream cluster new stream leader for 'USERS > stream2'
[12868] 2022/11/04 14:17:36.915820 [INF] JetStream cluster new metadata leader: n3/nats-cluster
Had to run nats stream cluster step-down
on some of the streams that had !
behind their replicas for nats stream report
to look good again.
What does nats traffic
show? Do you see many votes? That usually points to instability due to head of line blocking that can be caused by network saturation.
Seems the fact that there are "dangling" messages in the stream is irrelevant and is a completely separate issue. The CPU and memory spike happens even when nats stream report
shows 0 messages in all streams.
1. Before nats stream view some-stream
NATS load on all three nodes of the cluster:
2. Immediately after nats stream view some-stream
NATS load on all three nodes of the cluster:
After a few minutes NATS load on all three nodes returns to pre nats stream view
levels. However often, though not always, some stream replicas have an exclamation mark behind them in nats stream report
. After issuing nats stream cluster step-down
on every affected stream the exclamation marks are cleared. Here's nats traffic
after issuing nats stream cluster step-down
on a bunch of streams:
and for the stream in question what does nats stream info <stream>
return?
And the stream before the view command has no msgs since they have all been acked and removed since interest based stream correct?
I am trying to recreate but would it be possible, since the stream itself is small, to snapshot the stream and consumers and allow us to recreate in our environment?
I can see a CPU spike, with no messages etc in the stream but very high sequence numbers in the setup you described.
Will investigate more.
Thanks so much for your patience. I found it and is fixed in PR above. Once merged it will be part of the nightly builds as well.
Will be in nightly this evening. Was merged to both main and dev branches.
When using synadia/nats-server:nightly-20221107 (Git: [4e8d1ae2]) I haven't been able to cause memory or CPU pikes with nats stream view
!
As a side effect nats stream view
does not clear up messages in the stream any more. But I guess that's another topic altogether why we get seemingly dangling messages on our streams.
Are you still seeing dangling messages appearing after moving to he nightly builds?
Are you still seeing dangling messages appearing after moving to he nightly builds?
Correct. I've created a new issue with more details: https://github.com/nats-io/nats-server/issues/3612
Thanks will take a look.
nats-server: 2.9.1 nats: 0.0.34 3 node cluster, AWS EC2 Ubuntu
Load before:
Load after
nats stream view
After a few seconds CPU returned to before level. After a few minutes memory too returned to before level.
If curious about the max memory limit, its because previously (and with previous nats-server version) these spikes exhausted entire node memory causing node outage. Setting this limit doesn't seem to have affected the normal workings of nats-server.