Closed mtaobiz closed 2 months ago
Have any of the servers been restarted during this test?
Does it happen with the stream storage set to FileStorage
instead of MemoryStorage
?
Servers not restarted I dont test with FileStorage. Is it important?
With FileStorage continue to receive errors:
[293333] 2024/02/20 10:39:51.702404 [DBG] RAFT [qh7tjmNM - S-R3F-ErbH4zua] Installing snapshot of 216239 bytes
[293333] 2024/02/20 10:39:54.124782 [DBG] RAFT [qh7tjmNM - C-R3F-0J6szu52] Installing snapshot of 44 bytes
[293333] 2024/02/20 10:39:55.166339 [DBG] Apply stream entries for 'AD56GFOTE2TL6A7OYCOF6MW4INPOO44SJF3P3ZVTOJZLCV33J7ZP3UEQ > tickers' got error processing message: msgid is duplicate
[293333] 2024/02/20 10:39:57.178840 [DBG] Apply stream entries for 'AD56GFOTE2TL6A7OYCOF6MW4INPOO44SJF3P3ZVTOJZLCV33J7ZP3UEQ > tickers' got error processing message: msgid is duplicate
[293333] 2024/02/20 10:39:59.185842 [DBG] Apply stream entries for 'AD56GFOTE2TL6A7OYCOF6MW4INPOO44SJF3P3ZVTOJZLCV33J7ZP3UEQ > tickers' got error processing message: msgid is duplicate
And snapshot size is growing...
If you are exercising de-duplication via MsgId then that is ok.
But should not stop your consumer..
What does nats str subjects tickers
show?
Currently it shows one message per subject. And 2100 subjects. But currently consumer works fine...
And you use msgIds yes?
And you have max per subject of one I see.
There were some improvements that may help in 2.10.11 if you want to upgrade.
Look to consumer sequence and stream sequence numbers in first message. Today problem persist with another stream. And its too has replicas= 3. In new case consumer sequence is much greater than stream last sequence number too. In other streams, what function normally consumer sequence is not greater than stream sequence. And again I can see a lot of "duplicate msgid" errors in nats server log. This new case was with nats version 2.10.10. Looks like replication bug. I will test new nats version later.
New case. And now I have full details. Server version: 2.10.12
Information for Stream binanceSpotTickerQueue created 2024-03-27 15:32:09
Subjects: binance.ticker.spot.check
Replicas: 3
Storage: Memory
Placement Cluster: sandbox-nats-cluster
Options:
Retention: WorkQueue
Acknowledgments: true
Discard Policy: Old
Duplicate Window: 1.00s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: 1
Maximum Bytes: unlimited
Maximum Age: unlimited
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: sandbox-nats-cluster
Leader: nats-node-2
Replica: nats-node-1, current, seen 477ms ago
Replica: nats-node-3, current, seen 477ms ago
State:
Messages: 0
Bytes: 0 B
First Sequence: 360,545
Last Sequence: 360,544 @ 2024-05-07 18:10:01 UTC
Stream sequence is 360,544 Consumer:
Information for Consumer binanceSpotTickerQueue > ticker created 2024-03-27T15:32:11+01:00
Configuration:
Name: ticker
Description: Binance spot ticker update check consumer
Pull Mode: true
Deliver Policy: All
Ack Policy: Explicit
Ack Wait: 1m0s
Replay Policy: Instant
Maximum Deliveries: 1,000
Max Ack Pending: 4,000
Max Waiting Pulls: 512
Cluster Information:
Name: sandbox-nats-cluster
Leader: nats-node-3
Replica: nats-node-1, current, seen 465ms ago
Replica: nats-node-2, current, seen 465ms ago
State:
Last Delivered Message: Consumer sequence: 359,632 Stream sequence: 360,544 Last delivery: 478ms ago
Acknowledgment Floor: Consumer sequence: 359,632 Stream sequence: 360,544 Last Ack: 465ms ago
Outstanding Acks: 0 out of maximum 4,000
Redelivered Messages: 0
Unprocessed Messages: 0
Waiting Pulls: 1 of maximum 512
Consumer has stream sequence 360,544
10 min later:
Information for Stream binanceSpotTickerQueue created 2024-03-27 15:32:09
Subjects: binance.ticker.spot.check
Replicas: 3
Storage: Memory
Placement Cluster: sandbox-nats-cluster
Options:
Retention: WorkQueue
Acknowledgments: true
Discard Policy: Old
Duplicate Window: 1.00s
Allows Msg Delete: true
Allows Purge: true
Allows Rollups: false
Limits:
Maximum Messages: unlimited
Maximum Per Subject: 1
Maximum Bytes: unlimited
Maximum Age: unlimited
Maximum Message Size: unlimited
Maximum Consumers: unlimited
Cluster Information:
Name: sandbox-nats-cluster
Leader: nats-node-2
Replica: nats-node-1, current, seen 133ms ago
Replica: nats-node-3, current, seen 133ms ago
State:
Messages: 1
Bytes: 88 B
First Sequence: 292 @ 2024-05-07 18:20:01 UTC
Last Sequence: 292 @ 2024-05-07 18:20:01 UTC
Active Consumers: 1
Number of Subjects: 1
At this time stream sequence was 292 Consumer:
Information for Consumer binanceSpotTickerQueue > ticker created 2024-03-27T15:32:11+01:00
Configuration:
Name: ticker
Description: Binance spot ticker update check consumer
Pull Mode: true
Deliver Policy: All
Ack Policy: Explicit
Ack Wait: 1m0s
Replay Policy: Instant
Maximum Deliveries: 1,000
Max Ack Pending: 4,000
Max Waiting Pulls: 512
Cluster Information:
Name: sandbox-nats-cluster
Leader: nats-node-2
Replica: nats-node-1, current, seen 804ms ago
Replica: nats-node-3, current, seen 804ms ago
State:
Last Delivered Message: Consumer sequence: 359,927 Stream sequence: 360,839 Last delivery: 5m5s ago
Acknowledgment Floor: Consumer sequence: 359,927 Stream sequence: 360,839 Last Ack: 5m5s ago
Outstanding Acks: 0 out of maximum 4,000
Redelivered Messages: 0
Unprocessed Messages: 0
But consumer has stream sequence 360839 ! After this consumer stop receiving new messages
Between this time points cluster servers was restarted to new version 2.10.14:
[836983] 2024/05/03 14:04:04.034960 [INF] Starting nats-server
[836983] 2024/05/03 14:04:04.034984 [INF] Version: 2.10.12
[1029273] 2024/05/07 18:15:10.122406 [INF] Starting nats-server
[1029273] 2024/05/07 18:15:10.122439 [INF] Version: 2.10.14
log from one of cluster servers about broken stream:
[836983] 2024/05/07 18:14:13.513054 [INF] JetStream cluster new consumer leader for 'AD56GFOTE2TL6A7OYCOF6MW4INPOO44SJF3P3ZVTOJZLCV33J7ZP3UEQ > binanceSpotTickerQueue > ticker'
[836983] 2024/05/07 18:14:55.015563 [DBG] JetStream cluster, consumer 'AD56GFOTE2TL6A7OYCOF6MW4INPOO44SJF3P3ZVTOJZLCV33J7ZP3UEQ > binanceSpotTickerQueue > ticker' was already running
[836983] 2024/05/07 18:14:55.015574 [DBG] JetStream cluster, consumer 'AD56GFOTE2TL6A7OYCOF6MW4INPOO44SJF3P3ZVTOJZLCV33J7ZP3UEQ > binanceSpotTickerQueue > ticker' was already running
[836983] 2024/05/07 18:15:08.158077 [DBG] Exiting stream monitor for 'AD56GFOTE2TL6A7OYCOF6MW4INPOO44SJF3P3ZVTOJZLCV33J7ZP3UEQ > binanceSpotTickerQueue' [S-R3M-im3WIEsG]
[836983] 2024/05/07 18:15:08.195757 [DBG] Exiting consumer monitor for 'AD56GFOTE2TL6A7OYCOF6MW4INPOO44SJF3P3ZVTOJZLCV33J7ZP3UEQ > binanceSpotTickerQueue > ticker' [C-R3M-rrPUTr3c]
[1029273] 2024/05/07 18:15:10.148156 [DBG] Starting stream monitor for 'AD56GFOTE2TL6A7OYCOF6MW4INPOO44SJF3P3ZVTOJZLCV33J7ZP3UEQ > binanceSpotTickerQueue' [S-R3M-im3WIEsG]
[1029273] 2024/05/07 18:15:10.148160 [DBG] Waiting for consumers for interest based stream 'AD56GFOTE2TL6A7OYCOF6MW4INPOO44SJF3P3ZVTOJZLCV33J7ZP3UEQ > binanceSpotTickerQueue'
[1029273] 2024/05/07 18:15:10.148273 [DBG] Starting consumer monitor for 'AD56GFOTE2TL6A7OYCOF6MW4INPOO44SJF3P3ZVTOJZLCV33J7ZP3UEQ > binanceSpotTickerQueue > ticker' [C-R3M-rrPUTr3c]
# nats -s nats://sandbox-nats-v1:4222 --js-domain=cluster-sandbox-domain --creds=/etc/nats/creds/leaf.creds consumer next binanceSpotTickerQueue ticker
nats: error: no message received: nats: timeout
How do you update the servers? Do you check /healthz before moving on from one server to another?
We upgrage nodes one by one, and check each node status in systemd after restart.
But do you check nats-server healthz to return 200/ok before moving to next server?
no, we don`t call healthz
Be a good idea to turn on monitoring and check /healthz before moving onto the next server.
But what if node crashed and restarted? Its possible when two nodes crashed in same time and restarted simultaneously... I think nats server must not perform any data manipulation before it fully started and stay healthy.
That is correct and that is what NATS does, but if you start a server and move on to the next before waiting for it to be healthy you can allow the system to elect a leader who has not properly sync'd its state since you are using memory storage.
Using file storage resolves this issue?
We still run same process with healthz checks for all out managed systems, but file backed storage will allow a server to recover and sync quicker.
Fixed via #5179 as part of v2.10.12 release and memory streams rollouts were improved as part of #5506 on v2.10.17
Observed behavior
cli:
nats-server log:
"msgid is duplicate" not observed while stream not using replication
Expected behavior
Replicated stream works fine like not replicated
Server and client version
App with Go nats library
Host environment
Cluster. 3 nodes of: Debian GNU/Linux 11 (bullseye) AMD Ryzen 7 7700 8-Core Processor (family: 0x19, model: 0x61, stepping: 0x2) 64Gb RAM
Client: Debian GNU/Linux 11 (bullseye) AMD Ryzen 9 3900 12-Core Processor 128Gb RAM
Steps to reproduce
1 create stream:
2 Create consumer:
3 Post new messages in one app and receive it over consumer in over app within 24-48 hours. 4 Try to get new messages over consumer:
nats -s nats://sandbox-nats-v1:4222 --js-domain=cluster-sandbox-domain --creds=/etc/nats/creds/leaf.creds consumer next tickers ticker