nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.84k stars 1.4k forks source link

JetStream first sequence mismatch error keeps occuring for every restart/deployment #3516

Closed MauriceVanVeen closed 2 years ago

MauriceVanVeen commented 2 years ago

Defect

Make sure that these boxes are checked before submitting your issue -- thank you!

*Note: trace logging not enabled, since to reproduce this issue messaging needs to be constant (both adds and deletes), and this spams the logged output. Debug logging is included.

Versions of nats-server and affected client libraries used:

OS/Container environment:

nats: image: nats:2.9.2-alpine

logging: debug: true

jetstream: enabled: true

memStorage:
  enabled: true
  size: 5Gi

fileStorage:
  enabled: true
  size: 30Gi
  storageClassName: gp3-retained

resources: requests: cpu: 3 memory: 6Gi limits:

cpu: 4

  memory: 8Gi

Sets GOMEMLIMIT environment variable which makes the Go GC be aware of memory limits

from the container. Recommended to be set to about 90% of the resource memory limits.

gomemlimit: 7GiB


*Note: where `gp3-retained` is a Kubernetes StorageClass using `gp3` disks and `Retain` reclaim policy.

#### Steps or code to reproduce the issue:
- Setup file-based R3 stream, with a TTL (in this example a TTL of 1 minute, but in our real setup it's a 1 hour TTL, just so there is no need to wait for a long time):

nats str add benchstream --storage=file --replicas=3 --retention=limits --max-age=1m --max-bytes=-1 --max-consumers=-1 --max-msg-size=-1 --max-msgs=-1 --max-msgs-per-subject=-1 --discard=old --dupe-window=1m --subjects=js.bench --allow-rollup --deny-delete --no-deny-purge

- Start benchmark:

nats bench js.bench --pub=1 --msgs=10000000 --pubsleep=0.00001s --size=1KB --pubbatch=1000

- The `benchstream` stream persists messages for 1 minute, after which it will also start removing messages. The benchmark should keep running for a long time. After at least a minute has past; remove 1 instance (not sure if there is a difference in being a leader or not, tested by restarting a non-leader instance)
- While the instance is restarting the following messages will be displayed:

[7] 2022/10/03 13:43:30.411232 [WRN] Error applying entries to '$G > benchstream': first sequence mismatch [7] 2022/10/03 13:43:30.414851 [WRN] Resetting stream cluster state for '$G > benchstream'



#### Expected result:
The restarted instance should only need to catch up with the messages that it missed during it's downtime.

In this example a TTL of 1 minute is used, in our setup it's a 1 hour TTL, but with the same result.

#### Actual result:
The stream state is lost for this replica, and it needs to be reconstructed. Taking more time to perform the restart/deployment.

#### Files:
Before restart: [nats_logs_before.txt](https://github.com/nats-io/nats-server/files/9698092/nats_logs_before.txt)
After restart: [nats_logs_after.txt](https://github.com/nats-io/nats-server/files/9698093/nats_logs_after.txt)
kozlovic commented 2 years ago

I am trying to write a server test that would reproduce the issue. This error means that the snapshot the node receives has a first sequence that is lower than the first sequence in its store, so the node needs to clear the state. Suppose leader has a snapshot that says first sequence is 100 and last is 1000, but the follower that receives this snapshot has its first sequence being 150, then it assumes that it is missing messages 100 to 149 and due to file store being a message log (append-only), then it needs to wipe it states and sync with the leader. My suspicion is because of message TTL, there is a race where this snapshot is actually no longer valid in that the follower may have already discarded "old" messages? So far I have not been able to write a "simple" server test that demonstrates the issue. I will update when I am able to do so.

kozlovic commented 2 years ago

Sorry, I posted link to wrong issue in PR https://github.com/nats-io/nats-server/issues/3519. Reopening this one.

MauriceVanVeen commented 2 years ago

I'm not sure if that's part of this issue, but at least related due to the snapshots.

When repeating requests to the /healthz endpoint it flip-flops quite often from status: ok to status: unavailable with some stream or consumer not being current. Could this be because of those snapshots as well? Which would then also result in this First sequence mismatch upon restarting?

derekcollison commented 2 years ago

Have you run nats traffic as a system user to watch the stability of the system? If so what did it report? Did you see lots of vote traffic?

MauriceVanVeen commented 2 years ago

@derekcollison not too sure how to read the tables, but haven't seen any Proposal, Vote or Remove Peer numbers.

Here is a copy after running nats traffic for a while:

╭──────────────────────────────────────────────────────────────────╮
│                           Raft Traffic                           │
├──────────┬──────┬────────┬─────────────┬────────┬────────────────┤
│ Proposal │ Vote │ Append │ Remove Peer │ Reply  │ Total Messages │
├──────────┼──────┼────────┼─────────────┼────────┼────────────────┤
│ 0        │ 0    │ 9,281  │ 0           │ 18,563 │ 27,844         │
│ 0        │ 0    │ 11,362 │ 0           │ 22,728 │ 34,090         │
│ 0        │ 0    │ 8,647  │ 0           │ 17,294 │ 25,941         │
│ 0        │ 0    │ 10,775 │ 0           │ 21,544 │ 32,319         │
│ 0        │ 0    │ 9,853  │ 0           │ 19,711 │ 29,564         │
│ 0        │ 0    │ 9,080  │ 0           │ 18,161 │ 27,241         │
│ 0        │ 0    │ 7,795  │ 0           │ 15,590 │ 23,385         │
│ 0        │ 0    │ 9,153  │ 0           │ 18,242 │ 27,395         │
│ 0        │ 0    │ 11,755 │ 0           │ 23,574 │ 35,329         │
│ 0        │ 0    │ 10,947 │ 0           │ 21,893 │ 32,840         │
│ 0        │ 0    │ 12,798 │ 0           │ 25,597 │ 38,395         │
╰──────────┴──────┴────────┴─────────────┴────────┴────────────────╯

╭─────────────────────────────────────────────────────────────────────────────────╮
│                                 Cluster Traffic                                 │
├────────────┬─────────────┬───────────────┬─────────────┬───────┬────────────────┤
│ JSA Update │ Stream Info │ Consumer Info │ Stream Sync │ Reply │ Total Messages │
├────────────┼─────────────┼───────────────┼─────────────┼───────┼────────────────┤
│ 23         │ 0           │ 0             │ 0           │ 0     │ 23             │
│ 21         │ 1           │ 0             │ 0           │ 0     │ 22             │
│ 24         │ 0           │ 0             │ 0           │ 0     │ 24             │
│ 19         │ 1           │ 0             │ 0           │ 0     │ 20             │
│ 24         │ 0           │ 0             │ 0           │ 0     │ 24             │
│ 21         │ 0           │ 0             │ 0           │ 0     │ 21             │
│ 20         │ 1           │ 0             │ 0           │ 0     │ 21             │
│ 22         │ 0           │ 0             │ 0           │ 0     │ 22             │
│ 24         │ 1           │ 0             │ 0           │ 0     │ 25             │
│ 22         │ 0           │ 0             │ 0           │ 0     │ 22             │
│ 22         │ 1           │ 0             │ 0           │ 0     │ 23             │
╰────────────┴─────────────┴───────────────┴─────────────┴───────┴────────────────╯

╭───────────────────────────────────────────────────────────────────────────╮
│                              General Traffic                              │
├──────────┬────────┬────────┬────────┬──────┬─────────────┬────────────────┤
│ Requests │ JS API │ JS ACK │ System │ Rest │ Total Bytes │ Total Messages │
├──────────┼────────┼────────┼────────┼──────┼─────────────┼────────────────┤
│ 0        │ 0      │ 0      │ 0      │ 0    │ 3.1 MiB     │ 0              │
│ 0        │ 1      │ 0      │ 9      │ 1    │ 3.5 MiB     │ 11             │
│ 0        │ 0      │ 0      │ 0      │ 0    │ 2.4 MiB     │ 0              │
│ 0        │ 1      │ 0      │ 11     │ 1    │ 3.3 MiB     │ 13             │
│ 0        │ 0      │ 0      │ 0      │ 0    │ 2.8 MiB     │ 0              │
│ 0        │ 0      │ 0      │ 2      │ 0    │ 2.6 MiB     │ 2              │
│ 0        │ 1      │ 0      │ 9      │ 1    │ 2.8 MiB     │ 11             │
│ 0        │ 0      │ 0      │ 0      │ 0    │ 3.0 MiB     │ 0              │
│ 0        │ 1      │ 0      │ 9      │ 1    │ 3.5 MiB     │ 11             │
│ 0        │ 0      │ 0      │ 0      │ 0    │ 3.2 MiB     │ 0              │
│ 0        │ 1      │ 0      │ 9      │ 1    │ 4.1 MiB     │ 11             │
╰──────────┴────────┴────────┴────────┴──────┴─────────────┴────────────────╯
derekcollison commented 2 years ago

How many consumers do you have running?

MauriceVanVeen commented 2 years ago

4 consumers currently

derekcollison commented 2 years ago

A single stream? How many inbound messages to the streams? The append column looks high.

MauriceVanVeen commented 2 years ago

We currently have three streams, but two of them are mostly used. (The last one, not mentioned here, only has about 10-50 messages/s, with the same settings as the other two regarding replication, file and TTL)

A nats str report provides the following (adapted the stream names a bit to not expose the full name):

╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                     Consumer report for history with 1 consumers                                                │
├──────────────┬──────┬────────────┬──────────┬─────────────┬─────────────┬─────────────┬───────────────┬─────────────────────────┤
│ Consumer     │ Mode │ Ack Policy │ Ack Wait │ Ack Pending │ Redelivered │ Unprocessed │ Ack Floor     │ Cluster                 │
├──────────────┼──────┼────────────┼──────────┼─────────────┼─────────────┼─────────────┼───────────────┼─────────────────────────┤
│ rabbitmq     │ Push │ Explicit   │ 30.00s   │ 99          │ 0           │ 0           │ 2,053,581,370 │ nats-0, nats-1, nats-2* │
╰──────────────┴──────┴────────────┴──────────┴─────────────┴─────────────┴─────────────┴───────────────┴─────────────────────────╯

(this consumer reads messages from NATS and converts them to another format and forwards it to RabbitMQ, which is what we used prior to moving to NATS)

derekcollison commented 2 years ago

ok that would explain the higher append numbers.

kozlovic commented 2 years ago

FYI: I was able to reproduce the "first sequence mismatch" resulting in resetting the cluster using manual tests. With debugging, I think I can understand what is going on and will later try to write a unit test that can reproduce. I am not sure there is much we can do to prevent that, but we will discuss internally with @derekcollison to see if there is a way to mitigate.

kozlovic commented 2 years ago

Ok, I have a server test case that reproduces the issue. I will discuss with @derekcollison next week to see what are the next steps. Thank you for your patience.

MauriceVanVeen commented 2 years ago

Awesome, thank you! :tada:

jocrau commented 2 years ago

I can confirm that we are seeing this issue as well. Thanks for reporting it @MauriceVanVeen and for fixing it @kozlovic.

kozlovic commented 2 years ago

@jocrau I did not fix (just wrote the server test that demonstrated the issue). @derekcollison is the one who fixed it :-)

MauriceVanVeen commented 1 year ago

@derekcollison @kozlovic started to deploy the new version 2.9.4, and although the first sequence mismatch doesn't occur anymore. Another issue has taken its place, and it's not able to restore itself anymore. It continuously logs the below messages, until it eventually exits the server and tries again (infinitely).

It continuously logs these messages:

Catchup for stream 'X > Y' errored, will retry: wrong sequence for skipped msg
Catchup for stream 'X > Y' errored, will retry: expected sequence does not match store

Attached are the logs of this particular instance when deploying the new version. Will revert back to version 2.9.3 for now. nats-2-logs.txt

derekcollison commented 1 year ago

We do not suggest reverting going backwards.

Have you tried backing up the stream and restoring? Or scaling down to 1 replica and then scaling back up?

Have you considered giving more resources to the nats-server? Does your monitoring show you how much memory the system is using at steady state and on startup?

derekcollison commented 1 year ago

Could you share a tree output of the storage directories with us privately? Would it be possible to get access to all the storage directories to possibly try to recreate here?

MauriceVanVeen commented 1 year ago

Have shared the directories privately with Derek. Did manage to get those 3 instances to run 2.9.4 now.

MauriceVanVeen commented 1 year ago

Since upgrading to 2.9.5 it all started working perfectly, thanks again! (we've upgraded to 2.9.6 already, still working like a charm)