Open popaaaandrei opened 5 years ago
Do you have a persistence volume for the replicas? Could you share more info about the deployment, for example on which cloud is it running?
Thank you for responding. The setup is GKE (v1.12.6-gke.7) + nats-operator + nats-streaming-operator updated to the last releases. I don't think there is a PV, I use the standard stuff created through the operator, but at some point I will need to add persistency. This is still on a dev environment so we can play with various configs.
---
apiVersion: "nats.io/v1alpha2"
kind: "NatsCluster"
metadata:
name: "nats-cluster-1"
spec:
size: 3
---
apiVersion: "streaming.nats.io/v1alpha1"
kind: "NatsStreamingCluster"
metadata:
name: "nats-streaming-1"
spec:
size: 3
natsSvc: "nats-cluster-1"
Thanks for the info. On GKE do you have automatic node upgrades enabled? That drains the nodes and restarts all instances in a way that I think could affect the quorum of the cluster if only using local disk.
I have Automatic node upgrades = Disabled, Automatic node repair = Enabled on that cluster. But I remember that I upgraded k8s manually 4 days ago. The thing is after I upgraded the nodes, I checked that all the pods are in running state, so this happened after that.
Not having individual PV per pod makes the whole operator worthless.
Also because you are relying on creating individual pods (with no anti-affinity even), rather than a statefulset - you can't actually even create PVCs/PVs that would carefully match the scheduling that Kubernetes does.
IHMO People should probably just stop relying on the operator and use a proper statefulset straight up.
Was't resilience supposed to be the great benefit of deploying NATS clusters? I came this morning to the office and found ALL
nats-streaming-1-*
pods inCrashLoopBackOff
with around 500 restarts, meanwhile ALL messages have been obviously lost.Even if I delete all pods it still doesn't recover. I have to delete the whole
natsstreamingcluster.streaming.nats.io/nats-streaming-1
and recreate it to make it work.