nats-io / k8s

NATS on Kubernetes with Helm Charts
Apache License 2.0
446 stars 302 forks source link

NATS Container restart frequently in AKS Cluster with the following error logs #865

Closed saitessell closed 6 months ago

saitessell commented 7 months ago

What version were you using?

Helm chart 1.0.2

My NATS Helm chart values file is

config:
  nats:
    tls:
      enabled: true
      secretName: ${NATS_CERT_SECRET_NAME}
      cert: "tls.crt"
      key: "tls.key"
    resources:
      limits:
        cpu: 256m
        memory: 256Mi
      requests:
        cpu: 100m
        memory: 128Mi
  jetstream:
    enabled: true
    memoryStore:
      enabled: true
      # ensure that container has a sufficient memory limit greater than maxSize
      maxSize: 5Gi

    fileStore:
      pvc:
        enabled: true
        size: 5Gi
        storageClassName: aks-storage-class # NOTE: Azure setup but customize as needed for your infra.

  cluster:
    enabled: true
    replicas: 3
    name: nats-cluster
    noAdvertise: true
  resolver:
    enabled: true
    merge:
      type: full
      interval: 2m
      timeout: 1.9s
      allow_delete: true

With this configuration i am unable to get the NATS running and i am seeing the following logs in any of the nats pod


[1] 2024/02/12 05:56:42.769637 [ERR] Error trying to connect to route (attempt 1): lookup for host "nats-service-0.nats-service-headless": lookup nats-service-0.nats-service-headless on 172.29.255.254:53: no such host
[1] 2024/02/12 05:56:43.325412 [INF] JetStream cluster new metadata leader: nats-service-1/nats-service
[1] 2024/02/12 05:56:54.992207 [INF] 172.27.6.37:41188 - rid:8 - Route connection created
[1] 2024/02/12 05:56:54.992603 [INF] 172.27.6.37:41188 - rid:8 - Router connection closed: Duplicate Route
[1] 2024/02/12 05:57:00.934672 [INF] 172.27.5.82:41014 - rid:9 - Route connection created
[1] 2024/02/12 05:57:00.935144 [INF] 172.27.5.82:41014 - rid:9 - Router connection closed: Duplicate Route
[1] 2024/02/12 05:58:07.146084 [INF] JetStream cluster no metadata leader
[1] 2024/02/12 05:58:29.664137 [INF] JetStream cluster no metadata leader
[1] 2024/02/12 05:58:42.596377 [WRN] JetStream has not established contact with a meta leader
[1] 2024/02/12 05:58:50.704588 [INF] JetStream cluster no metadata leader
[1] 2024/02/12 05:59:13.759596 [INF] JetStream cluster no metadata leader
[1] 2024/02/12 05:59:39.226129 [INF] JetStream cluster no metadata leader
[1] 2024/02/12 06:00:00.277063 [INF] JetStream cluster no metadata leader

### What environment was the server running in?

NATS is deployed in AKS cluster with kubeDNS

### Is this defect reproducible?

Deploying the Helm Chart with Jetstream enabled and in cluster mode  is causing the nats containers in nats-service pods to not pass healthcheck probes 

### Given the capability you are leveraging, describe your expectation?

I want to enable NATS in cluster mode with jetstream enabled

### Given the expectation, what is the defect you are observing?

Because of the failure of the container i am not able to bring up the NATS 
caleblloyd commented 7 months ago

This looks like the configuration for the 0.x helm chart. Can you upgrade to the latest 1.x helm chart

https://github.com/nats-io/k8s/blob/main/helm/charts/nats/UPGRADING.md

saitessell commented 7 months ago

This is actually the config for 1.x version of helm chart. I took the reference from here https://github.com/nats-io/k8s/blob/nats-1.0.2/helm/charts/nats/values.yaml

caleblloyd commented 7 months ago

Ah ok, I must have misread it then. For the Resources those go under container.merge and not config.nats. Also if you are going to give it 5Gi in config.jetstream.memoryStore.maxSize you will want to make sure to request more than that amount of memory:

https://github.com/nats-io/k8s/blob/nats-1.0.2/helm/charts/nats/README.md#nats-container-resources

container:
  env:
    # different from k8s units, suffix must be B, KiB, MiB, GiB, or TiB
    # should be ~90% of memory limit
    GOMEMLIMIT: 7GiB
  merge:
    # recommended limit is at least 2 CPU cores and 8Gi Memory for production JetStream clusters
    resources:
      requests:
        cpu: "2"
        memory: 8Gi
      limits:
        cpu: "2"
        memory: 8Gi

From the looks of it, your containers are not able to establish network connectivity to one another. For example it looks like you named your deployment nats-service. So from nats-service-0 pod you should be able to resolve and connect to nats-service-1.nats-service-headless and nats-service-2.nats-service-headless