nats-io / nats-operator

NATS Operator
https://nats-io.github.io/k8s/
Apache License 2.0
575 stars 111 forks source link

NATS cluster bound tokens randomly being deleted #310

Open akubala opened 3 years ago

akubala commented 3 years ago

Hello!

I am using the following setup on Kubernetes v1.18.9 (EKS):

nats-operator

nats-cluster

nats-streaming

I am unable to catch any logs related with errors or warns in nats-operator, nats-cluster and nats-streaming.

The deletion is being random - to restore proper config, I have to reload all affected NatsServiceRoles created for my services. My services are using the following config for NetsServiceRole:

apiVersion: nats.io/v1alpha2
kind: NatsServiceRole
metadata:
  annotations:
    helm.fluxcd.io/antecedent: my-ns:helmrelease/company
  labels:
    nats_cluster: my-nats-cluster
  name: company
  namespace: my-nats-io
spec:
  permissions:
    publish:
    - '>'
    subscribe:
    - '>'

Moreover, the secrets for my services are being deleted, but for the nats-streaming, not. Also configuration that is stored in nats-cluster secret (nats.conf) is not touched when bound tokens are deleted. Please ping me back which information should I provide to create better description of the issue.

Thanks!

hpdobrica commented 3 years ago

having the exact same issue with the same config :(

hpdobrica commented 3 years ago

Probably worth mentioning that operator gives these logs when the problem occurs (usually just for one of the many disappeared secrets):

E0630 13:58:30.324121 1 generic.go:108] error syncing "nats-io/nats-cluster": failed to update auth data in config secret: secrets "some-nats-cluster-bound-token" not found

E0630 13:58:50.245055 1 generic.go:108] error syncing "nats-io/nats-cluster": failed to update auth data in config secret: Operation cannot be fulfilled on secrets "some-nats-cluster-bound-token": StorageError: invalid object, Code: 4, Key: /registry/secrets/app/some-nats-cluster-bound-token, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 1cc9f3e7-6791-48f8-8097-419d49a5783b, UID in object meta:

Currently on EKS 1.19

gaja-hp commented 2 years ago

Hello @hpdobrica did you find any solution for this issue? We are still stuck with this issue. thanks.

hpdobrica commented 2 years ago

Hey @gaja-hp, we didn't exactly "find a solution", but we mitigated the issue by moving away from service account authentication towards using basic authentication.

However, I might have an idea why the issue is occuring:

The deletion process is fueled by k8s ownerReferences - the idea is that once NatsServiceRole is deleted, the secret will be deleted as well because NatsServiceRole is its owner.

I believe the problem exists because ownerReference connection is not meant to function accross different namespaces (see note in https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/#owner-references-in-object-specifications) but not totally sure, i might be missing something.