JetStream Stream drift and failure to recover consistency observed in a R3 setup backed by file storage

Observed behavior

As part of testing system failure and recovery on a local 3 blade server running a NATS R3 in k8s, we have somehow ended up in an unexpectedly inconsistent state that NATS seems incapable to recover from.

We first observed the issue when consuming messages from a R3 Stream. Different messages were returned depending on which node the Consumer was created in. We're unsure of what exactly caused it but it was somehow triggered by intentionally pulling out the power cables from all servers and booting them back up.

Possible assumptions / Wild Guesses

The stream state is only verified based on the Meta stream state. This can cause the actual stream data to be completely out of sync. It seems to me that NATS is missing some data integrity checks on the actual stream data when recovering the state from it.
The stream disk state is only verified based on Last Sequence and fails to recover Streams that have a mismatch on First Sequence. There's a var errFirstSequenceMismatch = errors.New("first sequence mismatch") in nats-server code base but no place where this could be returned. This could indicate that at some point there was a check for it and that maybe it should be put back in place.
I understand that RAFT only cares about the log index which I assume translates to Last Sequence but from what I understand of NATS's RAFT implementation, it does not follow the Leader Append-Only property because it actually removes old messages from a Stream. I would think that removing messages would not be related with RAFT log at all but I have reasons to believe that's not the case according to the NATS JetSream Clustering docs:

NATS server optimizes this by combining the data plane for replicating messages with the messages RAFT would normally use to ensure consensus

I'm happy to run more tests and provide whatever extra info is needed. After trying to look into this myself, we though it would make sense to have the actual NATS maintainers shine some light into this instead of doing further guess work.

Expected behavior

NATS would be able to recover its state consistency across all replicas

Server and client version

Observed in

2.10.16
2.10.17-RC.6

Host environment

The NATS server is configured with 10Gb of file storage and no in-memory storage. The underlying volumes are provisioned with OpenEBS LocalPV on top of a multi disk ZFS pool of SSD storage.

Steps to reproduce

I have tried reproducing it locally without any success and have not tried reproducing it again in our test cluster as to not lose the state we have ended up in. I'm mostly convinced we have run into the same issue on a production cluster before but there were too many variables to be able to pinpoint what caused it at the time.

Here's a repo containing the minimum state and all configurations I used that match our cluster setup. It's not a reproduction of what caused the root issue but more of a snapshot of the broken state. I dumped all the state files from the server and re-created the cluster locally using docker-compose after removing all the other streams and consumers. After restarting nodes multiple times I could still observe the state drift and no error logs which indicates to me that NATS unaware of it.

Further investigation

Having no helpful logs and not enough knowledge of how to debug these kind of issues using the cli, I decided to look at the files that make up the JetStream storage. I focused on a stream named system-config. After finding it under the correct account directory there were 2 files. <index>.blk and index.db. nats-0 and nats-1 contained a 258.blk and nats-2 contained a 1.blk. There was some obvious correlation between that and the messages I was getting from the stream - I always got the same results from nats-0 and nats-1 which differed from nats-2. The stream data in nats-2 was looking correct with 7 messages, which matched the number of subjects with max of 1 message per subject whereas the other 2 nodes only contained the last 3 messages of the stream. So basically there's a match in Last Sequence but the number of messages, subjects and First Sequence don't match.

Stream info from node-0

Information for Stream system-config created 2024-06-13 14:49:29

              Subjects: config.system.>
              Replicas: 3
               Storage: File

Options:

             Retention: Limits
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: 1
         Maximum Bytes: unlimited
           Maximum Age: unlimited
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: nats-0
               Replica: nats-1, current, seen 489ms ago
               Replica: nats-2, current, seen 490ms ago

State:

              Messages: 3
                 Bytes: 217 B
        First Sequence: 1,027 @ 2024-06-19 16:14:50 UTC
         Last Sequence: 1,029 @ 2024-06-19 17:45:37 UTC
      Active Consumers: 0
    Number of Subjects: 3
</details>

Stream info from node-1

Information for Stream system-config created 2024-06-13 14:49:29

              Subjects: config.system.>
              Replicas: 3
               Storage: File

Options:

             Retention: Limits
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: 1
         Maximum Bytes: unlimited
           Maximum Age: unlimited
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: nats-1
               Replica: nats-0, current, seen 491ms ago
               Replica: nats-2, current, seen 490ms ago

State:

              Messages: 3
                 Bytes: 217 B
        First Sequence: 1,027 @ 2024-06-19 16:14:50 UTC
         Last Sequence: 1,029 @ 2024-06-19 17:45:37 UTC
      Active Consumers: 0
    Number of Subjects: 3

Stream info from node-2

Information for Stream system-config created 2024-06-13 14:49:29

              Subjects: config.system.>
              Replicas: 3
               Storage: File

Options:

             Retention: Limits
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: 1
         Maximum Bytes: unlimited
           Maximum Age: unlimited
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: nats-2
               Replica: nats-0, current, seen 461ms ago
               Replica: nats-1, current, seen 461ms ago

State:

              Messages: 7
                 Bytes: 1.1 KiB
        First Sequence: 778 @ 2024-06-19 17:19:31 UTC
         Last Sequence: 1,029 @ 2024-06-19 17:45:37 UTC
      Deleted Messages: 245
      Active Consumers: 0
    Number of Subjects: 7

Here are a couple of observations I have done by intentionally manipulating the stream state data.

If we run the following in the repro repo:

docker-compose up -d to start the cluster
nats stream cluster step-down system-config until you get nats-2 as the leader
docker-compose stop nats-1
sudo rm -rf data/nats-1/jetstream/SYS/_js_/* data/nats-1/jetstream/USERS/streams/system-config
docker-compose up nats-1

The state is correctly restored from the leader (nats-2).

But, if instead we only remove the stream data leaving the Meta stream data in place with sudo rm data/nats-1/jetstream/USERS/streams/system-config, then nats-1 comes back with a stream containing no data at all while reporting it as current.

Information for Stream system-config created 2024-06-13 14:49:29

              Subjects: config.system.>
              Replicas: 3
               Storage: File

Options:

             Retention: Limits
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: 1
         Maximum Bytes: unlimited
           Maximum Age: unlimited
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: nats-1
               Replica: nats-0, current, seen 491ms ago
               Replica: nats-2, current, seen 491ms ago

State:

              Messages: 0
                 Bytes: 0 B
        First Sequence: 0
         Last Sequence: 0
      Active Consumers: 0

nats-io / nats-server