All nats-streaming pods in CrashLoopBackOff state

popaaaandrei commented 5 years ago

Was't resilience supposed to be the great benefit of deploying NATS clusters? I came this morning to the office and found ALL nats-streaming-1-* pods in CrashLoopBackOff with around 500 restarts, meanwhile ALL messages have been obviously lost.

nats-cluster-1-1                                1/1     Running            0          41h
nats-cluster-1-2                                1/1     Running            0          42h
nats-cluster-1-3                                1/1     Running            0          41h
nats-operator-5b47bc4f8-77glm                   1/1     Running            0          42h
nats-streaming-1-1                              0/1     CrashLoopBackOff   490        42h
nats-streaming-1-2                              0/1     CrashLoopBackOff   486        41h
nats-streaming-1-3                              0/1     CrashLoopBackOff   490        42h
nats-streaming-operator-59647b496-v4vv5         1/1     Running            0          42h

$ kubectl logs nats-streaming-1-1
[1] 2019/04/05 10:43:45.406013 [INF] STREAM: Starting nats-streaming-server[nats-streaming-1] version 0.12.2
[1] 2019/04/05 10:43:45.406058 [INF] STREAM: ServerID: PlHofW9bI3tXiJYkkRkQCQ
[1] 2019/04/05 10:43:45.406061 [INF] STREAM: Go version: go1.11.6
[1] 2019/04/05 10:43:45.406064 [INF] STREAM: Git commit: [4489c46]
[1] 2019/04/05 10:43:45.422431 [INF] STREAM: Recovering the state...
[1] 2019/04/05 10:43:45.422755 [INF] STREAM: No recovered state
[1] 2019/04/05 10:43:45.422838 [INF] STREAM: Cluster Node ID : "nats-streaming-1-1"
[1] 2019/04/05 10:43:45.422847 [INF] STREAM: Cluster Log Path: nats-streaming-1/"nats-streaming-1-1"
[1] 2019/04/05 10:43:50.531934 [INF] STREAM: Shutting down.
[1] 2019/04/05 10:43:50.532450 [FTL] STREAM: Failed to start: failed to join Raft group nats-streaming-1

Even if I delete all pods it still doesn't recover. I have to delete the whole natsstreamingcluster.streaming.nats.io/nats-streaming-1 and recreate it to make it work.

wallyqs commented 5 years ago

Do you have a persistence volume for the replicas? Could you share more info about the deployment, for example on which cloud is it running?

popaaaandrei commented 5 years ago

Thank you for responding. The setup is GKE (v1.12.6-gke.7) + nats-operator + nats-streaming-operator updated to the last releases. I don't think there is a PV, I use the standard stuff created through the operator, but at some point I will need to add persistency. This is still on a dev environment so we can play with various configs.

---
apiVersion: "nats.io/v1alpha2"
kind: "NatsCluster"
metadata:
  name: "nats-cluster-1"
spec:
  size: 3
---
apiVersion: "streaming.nats.io/v1alpha1"
kind: "NatsStreamingCluster"
metadata:
  name: "nats-streaming-1"
spec:
  size: 3
  natsSvc: "nats-cluster-1"

wallyqs commented 5 years ago

Thanks for the info. On GKE do you have automatic node upgrades enabled? That drains the nodes and restarts all instances in a way that I think could affect the quorum of the cluster if only using local disk.

popaaaandrei commented 5 years ago

I have Automatic node upgrades = Disabled, Automatic node repair = Enabled on that cluster. But I remember that I upgraded k8s manually 4 days ago. The thing is after I upgraded the nodes, I checked that all the pods are in running state, so this happened after that.

Quentin-M commented 5 years ago

Not having individual PV per pod makes the whole operator worthless.

Quentin-M commented 5 years ago

Also because you are relying on creating individual pods (with no anti-affinity even), rather than a statefulset - you can't actually even create PVCs/PVs that would carefully match the scheduling that Kubernetes does.

IHMO People should probably just stop relying on the operator and use a proper statefulset straight up.

nats-io / nats-streaming-operator

All nats-streaming pods in CrashLoopBackOff state #40