Open rubencosta opened 1 week ago
Is this easily reproducible?
@derekcollison the issue is two-fold:
You can reproduce 2 from the repro repo.
Triggering the actual state file corruption is harder but I suspect it happens under a lot of writes and a non-clean exit. https://github.com/nats-io/nats-server/discussions/5308 seems to be related.
Observed behavior
As part of testing system failure and recovery on a local 3 blade server running a NATS R3 in k8s, we have somehow ended up in an unexpectedly inconsistent state that NATS seems incapable to recover from.
We first observed the issue when consuming messages from a R3 Stream. Different messages were returned depending on which node the Consumer was created in. We're unsure of what exactly caused it but it was somehow triggered by intentionally pulling out the power cables from all servers and booting them back up.
Possible assumptions / Wild Guesses
Last Sequence
and fails to recover Streams that have a mismatch onFirst Sequence
. There's a var errFirstSequenceMismatch = errors.New("first sequence mismatch") in nats-server code base but no place where this could be returned. This could indicate that at some point there was a check for it and that maybe it should be put back in place.Last Sequence
but from what I understand of NATS's RAFT implementation, it does not follow the Leader Append-Only property because it actually removes old messages from a Stream. I would think that removing messages would not be related with RAFT log at all but I have reasons to believe that's not the case according to the NATS JetSream Clustering docs:I'm happy to run more tests and provide whatever extra info is needed. After trying to look into this myself, we though it would make sense to have the actual NATS maintainers shine some light into this instead of doing further guess work.
Expected behavior
NATS would be able to recover its state consistency across all replicas
Server and client version
Observed in
Host environment
The NATS server is configured with 10Gb of file storage and no in-memory storage. The underlying volumes are provisioned with OpenEBS LocalPV on top of a multi disk ZFS pool of SSD storage.
Steps to reproduce
I have tried reproducing it locally without any success and have not tried reproducing it again in our test cluster as to not lose the state we have ended up in. I'm mostly convinced we have run into the same issue on a production cluster before but there were too many variables to be able to pinpoint what caused it at the time.
Here's a repo containing the minimum state and all configurations I used that match our cluster setup. It's not a reproduction of what caused the root issue but more of a snapshot of the broken state. I dumped all the state files from the server and re-created the cluster locally using docker-compose after removing all the other streams and consumers. After restarting nodes multiple times I could still observe the state drift and no error logs which indicates to me that NATS unaware of it.
Further investigation
Having no helpful logs and not enough knowledge of how to debug these kind of issues using the cli, I decided to look at the files that make up the JetStream storage. I focused on a stream named
system-config
. After finding it under the correct account directory there were 2 files.<index>.blk
andindex.db
.nats-0
andnats-1
contained a258.blk
andnats-2
contained a1.blk
. There was some obvious correlation between that and the messages I was getting from the stream - I always got the same results fromnats-0
andnats-1
which differed fromnats-2
. The stream data innats-2
was looking correct with 7 messages, which matched the number of subjects with max of 1 message per subject whereas the other 2 nodes only contained the last 3 messages of the stream. So basically there's a match inLast Sequence
but the number of messages, subjects andFirst Sequence
don't match.Stream info from node-0
Stream info from node-1
Stream info from node-2
Here are a couple of observations I have done by intentionally manipulating the stream state data.
If we run the following in the repro repo:
docker-compose up -d
to start the clusternats stream cluster step-down system-config
until you getnats-2
as the leaderdocker-compose stop nats-1
sudo rm -rf data/nats-1/jetstream/SYS/_js_/* data/nats-1/jetstream/USERS/streams/system-config
docker-compose up nats-1
The state is correctly restored from the leader (
nats-2
).But, if instead we only remove the stream data leaving the Meta stream data in place with
sudo rm data/nats-1/jetstream/USERS/streams/system-config
, thennats-1
comes back with a stream containing no data at all while reporting it as current.