nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
15.79k stars 1.4k forks source link

During rolling update in Kubernetes data in KV store is lost in some nodes #4351

Closed yoadey closed 1 month ago

yoadey commented 1 year ago

We are having a problem with our NATS cluster in a Kubernetes cluster, were the entries in KV stores differ betwenn the nats nodes after server restarts, leaving us with data loss in some cases. As the Kubernetes cluster gets updated every week, this issue was now seen several times, although we cannot reproduce it in a stable manner. Our current workaround is, that if we don't get the data in the first run, we do the same request up to 5 times and then we normally get the wanted data due to the server request routing.

Defect

Versions of nats-server and affected client libraries used:

[68] 2023/07/31 10:13:00.901015 [INF] Starting nats-server
[68] 2023/07/31 10:13:00.901069 [INF]   Version:  2.9.20
[68] 2023/07/31 10:13:00.901073 [INF]   Git:      [not set]
[68] 2023/07/31 10:13:00.901079 [DBG]   Go build: go1.20.6
[68] 2023/07/31 10:13:00.901091 [INF]   Name:     ND2HTZ5KWVBYGEJPIJBGHBVKSPXTUSWULKAOKXKSFLQNW23C6RYO7DTC
[68] 2023/07/31 10:13:00.901100 [INF]   ID:       ND2HTZ5KWVBYGEJPIJBGHBVKSPXTUSWULKAOKXKSFLQNW23C6RYO7DTC
[68] 2023/07/31 10:13:00.901175 [DBG] Created system account: "$SYS"

(We used a custom build from 2.9.20, where we included this specific fix: https://github.com/nats-io/nats-server/issues/4320, Problem also occured with offical 2.9.19 nats-server container)

All client libraries are affected.

OS/Container environment:

Kubernetes containerd

Steps or code to reproduce the issue:

Have the following cluster setup:

Expected result:

Requesting multiple times all entries in the same kv store should always return the same list.

Actual result:

Sometimes after restarts the result differs between the nodes and stay different.

Additional details

This does not occur after every restart, seem also to depend a little bit on the downtime from the node and for some KVs it seem to heal itself after some time (sometimes minutes, sometime hours, sometime days).

Go code for testscript:

Following script executes listing the entries in the KV store multiple times against the KV store and prints the result. As the NATS server routes the traffic by itself, we don't need to switch servers.

package main

import (
    "log"
    "os"
    "strings"

    "github.com/nats-io/nats.go"
)

func connect() nats.JetStreamContext {
    url := os.Getenv("NATS_URL")
    if url == "" {
        url = nats.DefaultURL
    }

    nc, err := nats.Connect(url)
    if err != nil {
        log.Fatal(err)
    }
    js, err := nc.JetStream()
    if err != nil {
        log.Fatal(err)
    }
    log.Printf("connected %s", url)
    return js
}

func kvNames() []string {

    js := connect()

    list := make([]string, 0, 1024)
    for r := range js.KeyValueStoreNames() {
        list = append(list, r[3:])
    }
    return list
}

func main() {
    log.Print("natscheck ...")

    list := kvNames()

    js := connect()
    for i, kvname := range list {
        log.Printf("%d/%d check buckets:%s", i, len(list), kvname)
        var lastList []string
        var entries int = -1
        for loop := 0; loop < 10; loop++ {

            kv, err := js.KeyValue(kvname)
            if err != nil {
                log.Fatalf("[%s]:%v", kvname, err)
            }
            currList, err := kv.Keys()
            if err != nil && !strings.Contains(err.Error(), "no keys found") {
                log.Fatal(err)
            }
            //kv.History()
            if loop > 1 {
                diff1 := difference(currList, lastList)
                if len(diff1) > 0 {
                    log.Printf("Found %d/%d different keys in current, but not last list: %s", len(diff1), len(currList), strings.Join(diff1, ", "))
                    entries = -1
                }
                diff2 := difference(lastList, currList)
                if len(diff2) > 0 {
                    log.Printf("Found %d/%d different keys in last, but not current list: %s", len(diff2), len(lastList), strings.Join(diff2, ", "))
                    entries = -1
                }
            } else {
                entries = len(currList)
            }
            lastList = currList
        }
        if entries > 0 {
            log.Printf("Found %d same entries", entries)
        }
    }
}

func difference(a, b []string) []string {
    mb := make(map[string]struct{}, len(b))
    for _, x := range b {
        mb[x] = struct{}{}
    }
    var diff []string
    for _, x := range a {
        if _, found := mb[x]; !found {
            diff = append(diff, x)
        }
    }
    return diff
}
derekcollison commented 1 year ago

When doing restarts, do you wait for /healthz to return ok before moving to next server?

derekcollison commented 1 year ago

Also be good to make sure we can reproduce from top of main, the 2.9.21 RC candidate.

yoadey commented 1 year ago

The readiness check waits for /healthz to return ok, and before that we don't server any traffic to that nats server. However, I have the impression, that the new nats server registers itself at the other nats severs before its /healthz returns ok and those ready nats nodes serve requests to it. But that's just an impression from the cpu and memory stats and some logs I had, which I can't say reliably.

derekcollison commented 1 year ago

For the direct GETS, which all stream peers can participate in yes once they think they are caught up they can participate..

wallyqs commented 1 year ago

One thing that you could try too is to bump the initialDelaySeconds to be longer during your rollouts:

    liveness:
      enabled: true
      initialDelaySeconds: 100

https://github.com/nats-io/k8s/blob/f87652b54cbcbee6c32f7a3392982f4f690f0dc2/helm/charts/nats/values.yaml#L51C7-L51C26

yoadey commented 1 year ago

Will try the change in the initialDelaySeconds for the next restart next week, we currently want to restart as few times as possible as only the production system hast currently the load where we can monitor this with good data. We are trying to setup another test instance where we can reproduce this behavior, but until we have it finished we have to wait.

yoadey commented 1 year ago

Just tested the liveness.initialDelaySeconds set to 120 but it didn't helped, also for readiness.initialDelaySeconds. One thing to note: the state before the test was good, meaning we didn't had any issues as the nats cluster seemed to have healed itself this morning when we made a rolling update to the application which uses the nats.

yoadey commented 1 year ago

I've tried to reproduce the behavior with a small go client, but didn't succeed. We can reproduce the behavior with our full java application in a dev environment, but couldn't yet isolate the root cause.

But we had it now multiple times, that the nats cluster synchronized itself after our application was restarted. Is there anything (open connections, special commands) which can prevent the cluster from syncronizing? Especially in the java library?

yoadey commented 1 year ago

I've seen, that in the helm chart the service has set publishNotReadyAddresses: true. Afterwards I found your blog post of the new helm chart release, explicitely mentioning this flag: https://nats.io/blog/nats-helm-chart-1.0.0-rc/

Could this be the reason for this behavior? Since we didn't create an additional service, our client probably connected to a not yet ready nats pod through this service, this would also render the readiness.initialDelaySeconds useless.

We'll try to upgrade to the latest chart, but this will take some time as we need to thoroughly test the changes before going life.

derekcollison commented 1 year ago

Keep us posted, you should not be losing any messages of course.

mpwalkerdine commented 8 months ago

We appear to have come across a very similar situation, in which an upgrade to NATS 2.10.9 has resulted in a loss of the complete history of messages on two R3 streams. Initial investigations suggest this is due to the server nodes being killed during recovery by a failing health check, but we're yet to determine a minimal reproduction (minus Nomad and CSI plugins for persistent disk storage in GCP) which demonstrates the deletion of the message block files in this case.

Any thoughts on additional avenues to explore or whether this sounds like a plausible explanation would be greatly appreciated.

wallyqs commented 8 months ago

@mpwalkerdine were you using interest streams?

mpwalkerdine commented 8 months ago

Hi @wallyqs - no, permanent streams with durable consumers. A couple of potentially interesting things appeared in our logs around this time too, Resetting WAL state and Stream state too short (0 bytes) :thinking:

mpwalkerdine commented 8 months ago

A quick update for any future readers - we've narrowed the issue down somewhat and it appears to be related to the NATS disks being pulled at inappropriate times by the CSI plugin, leading to corruption and the log line during next startup:
Detected orphaned stream '[ACC] > [NAME]' will cleanup, which ultimately causes that node to delete it's message storage on startup.

mpwalkerdine commented 8 months ago

@wallyqs FYI it looks like one or more of the fixes in 2.10.10 has resolved our issue - I can no longer reproduce the deletion on startup issue, no orphaned stream logs and no propagation of that empty state to other nodes in the cluster :crossed_fingers:

wallyqs commented 8 months ago

@mpwalkerdine thanks for the update!

mpwalkerdine commented 8 months ago

Looks like I spoke too soon - more testing shows that one of the nodes has lost its state on disk, and if it becomes a cluster leader that stream reports 0 messages. The files on disk are indeed empty:

-rw-r----- 1 root root  0 Feb  7 09:33 1.blk
-rw------- 1 root root 34 Feb  7 11:53 index.db
Information for Stream REDACTED created 2024-02-06 17:41:58

          Description: REDACTED
             Subjects: REDACTED.>
             Replicas: 3
              Storage: File

Options:

            Retention: Limits
     Acknowledgements: true
       Discard Policy: Old
     Duplicate Window: 2h0m0s
    Allows Msg Delete: true
         Allows Purge: true
       Allows Rollups: false

Limits:

     Maximum Messages: unlimited
  Maximum Per Subject: unlimited
        Maximum Bytes: 9.3 GiB
          Maximum Age: unlimited
 Maximum Message Size: unlimited
    Maximum Consumers: unlimited

Cluster Information:

                 Name: nats
               Leader: nats-node-2
              Replica: nats-node-0, current, seen 0.53s ago
              Replica: nats-node-1, current, seen 0.53s ago

State:

             Messages: 0
                Bytes: 0 B
             FirstSeq: 6,915
              LastSeq: 6,914
     Active Consumers: 362
yoadey commented 8 months ago

For me, the issue seems to be solved with the latest version 2.10.11. We are not able to reproduce it anymore, so it could be closed

Jarema commented 8 months ago

@mpwalkerdine Could you let us know if 2.10.11 fixed it for you too?

mpwalkerdine commented 8 months ago

@mpwalkerdine Could you let us know if 2.10.11 fixed it for you too?

I'll retest against 2.10.11 and report back. Our issue seems to be caused by some pretty unreasonable behaviour in our CSI disk management, so it is perhaps not unexpected for NATS to take issue with that. I did expect the faulty node to recover from the rest of the cluster though.

mpwalkerdine commented 8 months ago

@Jarema the behaviour has changed, although there still seems to be a discrepancy between nodes when they come back up:

Healthy node as leader:

State:

             Messages: 789
                Bytes: 5.0 MiB
             FirstSeq: 1 @ 2024-02-21T13:29:29 UTC
              LastSeq: 789 @ 2024-02-21T13:43:32 UTC
     Active Consumers: 321
   Number of Subjects: 95

Faulty node as leader:

State:

             Messages: 543
                Bytes: 2.4 MiB
             FirstSeq: 247 @ 2024-02-21T13:42:58 UTC
              LastSeq: 789 @ 2024-02-21T13:43:32 UTC
     Active Consumers: 321
   Number of Subjects: 3

So they're no longer losing all their state, just some of it.

Pertinent logs appear to be:

mpwalkerdine commented 7 months ago

Another quick update - we've been running 2.10.11 in production for a week or so now and although it appears the state is no longer lost completely, it does still break the ability to publish messages when they're sent to a node with bad state. Our workaround for this is to scale down to a healthy node and up again and recreate all the consumers (we've got separate idempotency protection).

wallyqs commented 1 month ago

Fixed via #5821, thanks a lot @yoadey for the reproduction at https://github.com/yoadey/nats-test that was very useful.