nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.1k stars 1.36k forks source link

JetStream Stream drift and failure to recover consistency observed in a R3 setup backed by file storage #5576

Open rubencosta opened 1 week ago

rubencosta commented 1 week ago

Observed behavior

As part of testing system failure and recovery on a local 3 blade server running a NATS R3 in k8s, we have somehow ended up in an unexpectedly inconsistent state that NATS seems incapable to recover from.

We first observed the issue when consuming messages from a R3 Stream. Different messages were returned depending on which node the Consumer was created in. We're unsure of what exactly caused it but it was somehow triggered by intentionally pulling out the power cables from all servers and booting them back up.

Possible assumptions / Wild Guesses

I'm happy to run more tests and provide whatever extra info is needed. After trying to look into this myself, we though it would make sense to have the actual NATS maintainers shine some light into this instead of doing further guess work.

Expected behavior

NATS would be able to recover its state consistency across all replicas

Server and client version

Observed in

Host environment

The NATS server is configured with 10Gb of file storage and no in-memory storage. The underlying volumes are provisioned with OpenEBS LocalPV on top of a multi disk ZFS pool of SSD storage.

Steps to reproduce

I have tried reproducing it locally without any success and have not tried reproducing it again in our test cluster as to not lose the state we have ended up in. I'm mostly convinced we have run into the same issue on a production cluster before but there were too many variables to be able to pinpoint what caused it at the time.

Here's a repo containing the minimum state and all configurations I used that match our cluster setup. It's not a reproduction of what caused the root issue but more of a snapshot of the broken state. I dumped all the state files from the server and re-created the cluster locally using docker-compose after removing all the other streams and consumers. After restarting nodes multiple times I could still observe the state drift and no error logs which indicates to me that NATS unaware of it.

Further investigation

Having no helpful logs and not enough knowledge of how to debug these kind of issues using the cli, I decided to look at the files that make up the JetStream storage. I focused on a stream named system-config. After finding it under the correct account directory there were 2 files. <index>.blk and index.db. nats-0 and nats-1 contained a 258.blk and nats-2 contained a 1.blk. There was some obvious correlation between that and the messages I was getting from the stream - I always got the same results from nats-0 and nats-1 which differed from nats-2. The stream data in nats-2 was looking correct with 7 messages, which matched the number of subjects with max of 1 message per subject whereas the other 2 nodes only contained the last 3 messages of the stream. So basically there's a match in Last Sequence but the number of messages, subjects and First Sequence don't match.

Stream info from node-0

Information for Stream system-config created 2024-06-13 14:49:29

              Subjects: config.system.>
              Replicas: 3
               Storage: File

Options:

             Retention: Limits
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: 1
         Maximum Bytes: unlimited
           Maximum Age: unlimited
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: nats-0
               Replica: nats-1, current, seen 489ms ago
               Replica: nats-2, current, seen 490ms ago

State:

              Messages: 3
                 Bytes: 217 B
        First Sequence: 1,027 @ 2024-06-19 16:14:50 UTC
         Last Sequence: 1,029 @ 2024-06-19 17:45:37 UTC
      Active Consumers: 0
    Number of Subjects: 3
</details>

Stream info from node-1

Information for Stream system-config created 2024-06-13 14:49:29

              Subjects: config.system.>
              Replicas: 3
               Storage: File

Options:

             Retention: Limits
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: 1
         Maximum Bytes: unlimited
           Maximum Age: unlimited
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: nats-1
               Replica: nats-0, current, seen 491ms ago
               Replica: nats-2, current, seen 490ms ago

State:

              Messages: 3
                 Bytes: 217 B
        First Sequence: 1,027 @ 2024-06-19 16:14:50 UTC
         Last Sequence: 1,029 @ 2024-06-19 17:45:37 UTC
      Active Consumers: 0
    Number of Subjects: 3

Stream info from node-2

Information for Stream system-config created 2024-06-13 14:49:29

              Subjects: config.system.>
              Replicas: 3
               Storage: File

Options:

             Retention: Limits
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: 1
         Maximum Bytes: unlimited
           Maximum Age: unlimited
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: nats-2
               Replica: nats-0, current, seen 461ms ago
               Replica: nats-1, current, seen 461ms ago

State:

              Messages: 7
                 Bytes: 1.1 KiB
        First Sequence: 778 @ 2024-06-19 17:19:31 UTC
         Last Sequence: 1,029 @ 2024-06-19 17:45:37 UTC
      Deleted Messages: 245
      Active Consumers: 0
    Number of Subjects: 7

Here are a couple of observations I have done by intentionally manipulating the stream state data.

If we run the following in the repro repo:

The state is correctly restored from the leader (nats-2).

But, if instead we only remove the stream data leaving the Meta stream data in place with sudo rm data/nats-1/jetstream/USERS/streams/system-config, then nats-1 comes back with a stream containing no data at all while reporting it as current.

Information for Stream system-config created 2024-06-13 14:49:29

              Subjects: config.system.>
              Replicas: 3
               Storage: File

Options:

             Retention: Limits
       Acknowledgments: true
        Discard Policy: Old
      Duplicate Window: 2m0s
     Allows Msg Delete: true
          Allows Purge: true
        Allows Rollups: false

Limits:

      Maximum Messages: unlimited
   Maximum Per Subject: 1
         Maximum Bytes: unlimited
           Maximum Age: unlimited
  Maximum Message Size: unlimited
     Maximum Consumers: unlimited

Cluster Information:

                  Name: nats
                Leader: nats-1
               Replica: nats-0, current, seen 491ms ago
               Replica: nats-2, current, seen 491ms ago

State:

              Messages: 0
                 Bytes: 0 B
        First Sequence: 0
         Last Sequence: 0
      Active Consumers: 0
derekcollison commented 6 days ago

Is this easily reproducible?

rubencosta commented 6 days ago

@derekcollison the issue is two-fold:

  1. the state files become inconsistent across replicas
  2. NATS can't recover from it

You can reproduce 2 from the repro repo.

Triggering the actual state file corruption is harder but I suspect it happens under a lot of writes and a non-clean exit. https://github.com/nats-io/nats-server/discussions/5308 seems to be related.