nats-io / k8s

NATS on Kubernetes with Helm Charts
Apache License 2.0
457 stars 306 forks source link

Prom-Exporter container failing : [ERR] Could not find server_id: invalid character 'C' looking for beginning of value #785

Closed ksingh-scogo closed 8 months ago

ksingh-scogo commented 1 year ago

After deploying nats cluster using helm the prom-exporter container is failing inside the jestream pod, with the error [ERR] Could not find server_id: invalid character 'C' looking for beginning of value

This is causing the health check to fail with error Warning Unhealthy 75s (x16 over 3m45s) kubelet Startup probe failed: HTTP probe failed with statuscode: 400

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/nats-jetstream LoadBalancer 10.0.111.248 20.219.34.176 4222:31806/TCP,8080:30823/TCP 13m service/nats-jetstream-headless ClusterIP None 4222/TCP,8080/TCP,6222/TCP,8222/TCP 13m

NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/nats-jetstream-box 1/1 1 1 13m

NAME DESIRED CURRENT READY AGE replicaset.apps/nats-jetstream-box-56586bdf9f 1 1 1 13m

NAME READY AGE statefulset.apps/nats-jetstream 0/3 13m


- `values.yaml`

################################################################################

Global options

################################################################################ global: image:

global image pull policy to use for all container images in the chart

# can be overridden by individual image pullPolicy
pullPolicy: IfNotPresent
# global registry to use for all container images in the chart
# can be overridden by individual image registry
registry:

global labels will be applied to all resources deployed by the chart

labels: app.kubernetes.io/name: nats app.kubernetes.io/version: 2.9.21 app.kubernetes.io/managed-by: Helm

################################################################################

Common options

################################################################################

override name of the chart

nameOverride:

override full name of the chart+release

fullnameOverride: nats-jetstream

override the namespace that resources are installed into

namespaceOverride:

reference a common CA Certificate or Bundle in all nats config tls blocks and nats-box contexts

note: tls.verify still must be set in the appropriate nats config tls blocks to require mTLS

tlsCA: enabled: false

set configMapName in order to mount an existing configMap to dir

configMapName: nats-ca

set secretName in order to mount an existing secretName to dir

secretName:

directory to mount the configMap or secret to

dir: /etc/nats-ca-cert

key in the configMap or secret that contains the CA Certificate or Bundle

key: ca.crt

################################################################################

NATS Stateful Set and associated resources

################################################################################

############################################################

NATS config

############################################################ config: cluster: enabled: true port: 6222

must be 2 or higher when jetstream is enabled

replicas: 3

# apply to generated route URLs that connect to other pods in the StatefulSet
routeURLs:
  # if both user and password are set, they will be added to route URLs
  # and the cluster authorization block
  user:
  password:
  # set to true to use FQDN in route URLs
  useFQDN: true
  k8sClusterDomain: cluster.local

tls:
  enabled: false
  # set secretName in order to mount an existing secret to dir
  secretName: nats-cluster-tls
  dir: /etc/nats-certs/cluster
  cert: tls.crt
  key: tls.key
  # merge or patch the tls config
  # https://docs.nats.io/running-a-nats-service/configuration/securing_nats/tls
  merge: {}
  patch: []

# merge or patch the cluster config
# https://docs.nats.io/running-a-nats-service/configuration/clustering/cluster_config
merge: {}
patch: []

jetstream: enabled: true fileStore: enabled: true dir: /data

  ############################################################
  # stateful set -> volume claim templates -> jetstream pvc
  ############################################################
  pvc:
    enabled: true
    size: 1Gi # use 10Gi for production
    storageClassName: managed-premium

    # merge or patch the jetstream pvc
    # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#persistentvolumeclaim-v1-core
    merge: {}
    patch: []
    # defaults to "{{ include "nats.fullname" $ }}-js"
    name:

  # defaults to the PVC size
  maxSize:

memoryStore:
  enabled: false
  # ensure that container has a sufficient memory limit greater than maxSize
  maxSize: 1Gi

# merge or patch the jetstream config
# https://docs.nats.io/running-a-nats-service/configuration#jetstream
merge: {}
patch: []

nats: port: 4222 tls: enabled: true

set secretName in order to mount an existing secret to dir

  secretName: nats-client-tls
  dir: /etc/nats-certs/nats
  cert: tls.crt
  key: tls.key
  # merge or patch the tls config
  # https://docs.nats.io/running-a-nats-service/configuration/securing_nats/tls
  merge: {}
  patch: []

websocket: enabled: true port: 8080 tls: enabled: false

set secretName in order to mount an existing secret to dir

  secretName: nats-client-tls
  dir: /etc/nats-certs/websocket
  cert: tls.crt
  key: tls.key
  # merge or patch the tls config
  # https://docs.nats.io/running-a-nats-service/configuration/securing_nats/tls
  merge: {}
  patch: []

monitor: enabled: true port: 8222 tls:

config.nats.tls must be enabled also

  # when enabled, monitoring port will use HTTPS with the options from config.nats.tls
  enabled: true

resolver: enabled: false dir: /data/resolver

############################################################
# stateful set -> volume claim templates -> resolver pvc
############################################################
pvc:
  enabled: true
  size: 1Gi
  storageClassName:

  # merge or patch the pvc
  # https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#persistentvolumeclaim-v1-core
  merge: {}
  patch: []
  # defaults to "{{ include "nats.fullname" $ }}-resolver"
  name:

############################################################

stateful set -> pod template -> nats container

############################################################ container: image: repository: nats tag: 2.9.21-alpine pullPolicy: IfNotPresent merge:

recommended limit is at least 2 CPU cores and 8Gi Memory for production JetStream clusters

resources:
  requests:
    cpu: "250m"
    memory: 64Mi
  limits:
    cpu: "500m"
    memory: 128Mi

container port options

must be enabled in the config section also

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#containerport-v1-core

ports: nats: {} leafnodes: {} websocket: {} mqtt: {} cluster: {} gateway: {} monitor: {} profiling: {}

map with key as env var name, value can be string or map

example:

#

env:

GOMEMLIMIT: 7GiB

TOKEN:

valueFrom:

secretKeyRef:

name: nats-auth

key: token

env: {}

merge or patch the container

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#container-v1-core

merge: {}

patch: []

############################################################

stateful set -> pod template -> reloader container

############################################################ reloader: enabled: true image: repository: natsio/nats-server-config-reloader tag: 0.11.0 pullPolicy: IfNotPresent merge:

recommended limit is at least 2 CPU cores and 8Gi Memory for production JetStream clusters

resources:
  requests:
    cpu: "50m"
    memory: 64Mi
  limits:
    cpu: "50m"
    memory: 64Mi

env var map, see nats.env for an example

env: {}

all nats container volume mounts with the following prefixes

will be mounted into the reloader container

natsVolumeMountPrefixes:

############################################################

stateful set -> pod template -> prom-exporter container

############################################################

config.monitor must be enabled

promExporter: enabled: true image: repository: natsio/prometheus-nats-exporter tag: 0.12.0 pullPolicy: registry:

port: 7777

env var map, see nats.env for an example

env: {}

merge or patch the container

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#container-v1-core

merge: {} patch: []

############################################################

prometheus pod monitor

############################################################ podMonitor: enabled: true

# merge or patch the pod monitor
# https://prometheus-operator.dev/docs/operator/api/#monitoring.coreos.com/v1.PodMonitor
merge: {}
patch: []
# defaults to "{{ include "nats.fullname" $ }}"
name:

############################################################

service

############################################################

Bug : Currently Nats Helm chart does not allow to add annotation at service level

See : https://github.com/nats-io/k8s/issues/784

Workaround : Add annotation manually

kubectl annotate service -n nats nats-jetstream service.beta.kubernetes.io/azure-load-balancer-resource-group=AzureResourceGroup

service: enabled: true merge: spec: type: LoadBalancer loadBalancerIP: "x.x.x.x" patch: []

defaults to "{{ include "nats.fullname" $ }}"

name:

############################################################

other nats extension points

############################################################

stateful set

statefulSet:

merge or patch the stateful set

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#statefulset-v1-apps

merge: {} patch: []

defaults to "{{ include "nats.fullname" $ }}"

name:

stateful set -> pod template

podTemplate:

adds a hash of the ConfigMap as a pod annotation

this will cause the StatefulSet to roll when the ConfigMap is updated

configChecksumAnnotation: true

map of topologyKey: topologySpreadConstraint

labelSelector will be added to match StatefulSet pods

# topologySpreadConstraints: kubernetes.io/hostname: maxSkew: 1 whenUnsatisfiable: DoNotSchedule #

topologySpreadConstraints: {}

merge or patch the pod template

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#pod-v1-core

merge: {} patch: []

headless service

headlessService: enabled: false

merge or patch the headless service

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#service-v1-core

merge: {} patch: []

defaults to "{{ include "nats.fullname" $ }}-headless"

name:

config map

configMap:

merge or patch the config map

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#configmap-v1-core

merge: {} patch: []

defaults to "{{ include "nats.fullname" $ }}-config"

name:

pod disruption budget

podDisruptionBudget: enabled: true

merge or patch the pod disruption budget

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#poddisruptionbudget-v1-policy

merge: {} patch: []

defaults to "{{ include "nats.fullname" $ }}"

name:

service account

serviceAccount: enabled: false

merge or patch the service account

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#serviceaccount-v1-core

merge: {} patch: []

defaults to "{{ include "nats.fullname" $ }}"

name:

############################################################

natsBox

#

NATS Box Deployment and associated resources

############################################################ natsBox: enabled: true

############################################################

NATS contexts

############################################################ contexts: default: creds:

set contents in order to create a secret with the creds file contents

    contents:
    # set secretName in order to mount an existing secret to dir
    secretName:
    # defaults to /etc/nats-creds/<context-name>
    dir:
    key: nats.creds
  nkey:
    # set contents in order to create a secret with the nkey file contents
    contents:
    # set secretName in order to mount an existing secret to dir
    secretName:
    # defaults to /etc/nats-nkeys/<context-name>
    dir:
    key: nats.nk
  # used to connect with client certificates
  tls:
    # set secretName in order to mount an existing secret to dir
    secretName:
    # defaults to /etc/nats-certs/<context-name>
    dir:
    cert: tls.crt
    key: tls.key

  # merge or patch the context
  # https://docs.nats.io/using-nats/nats-tools/nats_cli#nats-contexts
  merge: {}
  patch: []

name of context to select by default

defaultContextName: default

############################################################

deployment -> pod template -> nats-box container

############################################################ container: image: repository: natsio/nats-box tag: 0.13.8 pullPolicy: IfNotPresent registry:

# env var map, see nats.env for an example
env: {}

# merge or patch the container
# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#container-v1-core
merge: {}
patch: []

############################################################

other nats-box extension points

############################################################

deployment

deployment:

merge or patch the deployment

# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#deployment-v1-apps
merge: {}
patch: []
# defaults to "{{ include "nats.fullname" $ }}-box"
name:

deployment -> pod template

podTemplate:

merge or patch the pod template

# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#pod-v1-core
merge: {}
patch: []

contexts secret

contextsSecret:

merge or patch the context secret

# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#secret-v1-core
merge: {}
patch: []
# defaults to "{{ include "nats.fullname" $ }}-box-contexts"
name:

contents secret

contentsSecret:

merge or patch the contents secret

# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#secret-v1-core
merge: {}
patch: []
# defaults to "{{ include "nats.fullname" $ }}-box-contents"
name:

service account

serviceAccount: enabled: false

merge or patch the service account

# https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.24/#serviceaccount-v1-core
merge: {}
patch: []
# defaults to "{{ include "nats.fullname" $ }}-box"
name:

################################################################################

Extra user-defined resources

################################################################################ #

add arbitrary user-generated resources

example:

#

config:

websocket:

enabled: true

extraResources:

- apiVersion: networking.istio.io/v1beta1

kind: VirtualService

metadata:

name:

$tplYaml: >

{{ include "nats.fullname" $ | quote }}

labels:

$tplYaml: |

{{ include "nats.labels" $ }}

spec:

hosts:

- demo.nats.io

gateways:

- my-gateway

http:

- name: default

match:

- name: root

uri:

exact: /

route:

- destination:

host:

$tplYaml: >

{{ .Values.service.name | quote }}

port:

number:

$tplYaml: >

{{ .Values.config.websocket.port }}

# extraResources: []


- `kubectl describe pod/nats-jetstream-0`

Name: nats-jetstream-0 Namespace: nats Priority: 0 Service Account: default Node: aks-ondemand-20722617-vmss000002/10.224.0.6 Start Time: Sun, 20 Aug 2023 23:42:54 +0530 Labels: app.kubernetes.io/component=nats app.kubernetes.io/instance=nats app.kubernetes.io/managed-by=Helm app.kubernetes.io/name=nats app.kubernetes.io/version=2.9.21 controller-revision-hash=nats-jetstream-8b6bb9b85 environment=staging helm.sh/chart=nats-1.0.2 statefulset.kubernetes.io/pod-name=nats-jetstream-0 Annotations: checksum/config: b40d09e5645850ef7937f414d605f9f91199172781c2f89e698bffcef15ff9ee Status: Running IP: 10.244.1.35 IPs: IP: 10.244.1.35 Controlled By: StatefulSet/nats-jetstream Containers: nats: Container ID: containerd://44dc3d9a95f761b42cc6809604f216f1b79b4cb588a05f5c53678a120773dbeb Image: nats:2.9.21-alpine Image ID: docker.io/library/nats@sha256:511f5c4cfc6fdd61eb66afab99dfb38bed69aae630d8d5b36bc9bfc716723cd8 Ports: 4222/TCP, 8080/TCP, 6222/TCP, 8222/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP Args: --config /etc/nats-config/nats.conf State: Running Started: Sun, 20 Aug 2023 23:43:07 +0530 Ready: False Restart Count: 0 Limits: cpu: 500m memory: 128Mi Requests: cpu: 250m memory: 64Mi Liveness: http-get http://:monitor/healthz%3Fjs-enabled-only=true delay=10s timeout=5s period=30s #success=1 #failure=3 Readiness: http-get http://:monitor/healthz%3Fjs-server-only=true delay=10s timeout=5s period=10s #success=1 #failure=3 Startup: http-get http://:monitor/healthz delay=10s timeout=5s period=10s #success=1 #failure=90 Environment: POD_NAME: nats-jetstream-0 (v1:metadata.name) SERVER_NAME: $(POD_NAME) Mounts: /data from nats-jetstream-js (rw) /etc/nats-certs/nats from nats-tls (rw) /etc/nats-config from config (rw) /var/run/nats from pid (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bvdpc (ro) reloader: Container ID: containerd://834f4510bcdec3f46e532fd88e9a93e1d746300004eccc523c46dff844278a19 Image: natsio/nats-server-config-reloader:0.11.0 Image ID: docker.io/natsio/nats-server-config-reloader@sha256:c3a755eab2cc4702878d8d7bb75b82cd692a2557315cd18a0fac84f77f9253c9 Port: Host Port: Args: -pid /var/run/nats/nats.pid -config /etc/nats-config/nats.conf -config /etc/nats-certs/nats/tls.crt -config /etc/nats-certs/nats/tls.key State: Running Started: Sun, 20 Aug 2023 23:43:08 +0530 Ready: True Restart Count: 0 Limits: cpu: 50m memory: 64Mi Requests: cpu: 50m memory: 64Mi Environment: Mounts: /etc/nats-certs/nats from nats-tls (rw) /etc/nats-config from config (rw) /var/run/nats from pid (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bvdpc (ro) prom-exporter: Container ID: containerd://23ada762bd737e752e17a00c9b223468dce58de93e066e4e11c2ab388ba169a8 Image: natsio/prometheus-nats-exporter:0.12.0 Image ID: docker.io/natsio/prometheus-nats-exporter@sha256:74e768968abb7883f6c89639a4d7d8f59054c61297c1f4c4b633cfeb6c8127dc Port: 7777/TCP Host Port: 0/TCP Args: -port=7777 -connz -routez -subz -varz -prefix=nats -use_internal_server_id -jsz=all http://localhost:8222/ State: Running Started: Sun, 20 Aug 2023 23:43:08 +0530 Ready: True Restart Count: 0 Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bvdpc (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: nats-jetstream-js: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: nats-jetstream-js-nats-jetstream-0 ReadOnly: false config: Type: ConfigMap (a volume populated by a ConfigMap) Name: nats-jetstream-config Optional: false pid: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: nats-tls: Type: Secret (a volume populated by a Secret) SecretName: nats-client-tls Optional: false kube-api-access-bvdpc: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Topology Spread Constraints: kubernetes.io/hostname:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/component=nats,app.kubernetes.io/instance=nats,app.kubernetes.io/name=nats Events: Type Reason Age From Message


Normal Scheduled 4m7s default-scheduler Successfully assigned nats/nats-jetstream-0 to aks-ondemand-20722617-vmss000002 Normal SuccessfulAttachVolume 3m56s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-2971d4c6-ab0c-4cbc-9f1e-332227b1de14" Normal Pulled 3m55s kubelet Container image "nats:2.9.21-alpine" already present on machine Normal Created 3m55s kubelet Created container nats Normal Started 3m54s kubelet Started container nats Normal Pulled 3m54s kubelet Container image "natsio/nats-server-config-reloader:0.11.0" already present on machine Normal Created 3m54s kubelet Created container reloader Normal Started 3m54s kubelet Started container reloader Normal Pulled 3m54s kubelet Container image "natsio/prometheus-nats-exporter:0.12.0" already present on machine Normal Created 3m54s kubelet Created container prom-exporter Normal Started 3m54s kubelet Started container prom-exporter Warning Unhealthy 75s (x16 over 3m45s) kubelet Startup probe failed: HTTP probe failed with statuscode: 400

[33] 2023/08/20 18:13:48.230733 [ERR] Could not find server_id: invalid character 'C' looking for beginning of value [33] 2023/08/20 18:14:18.231273 [ERR] Could not find server_id: invalid character 'C' looking for beginning of value

ksingh-scogo commented 1 year ago
image
caleblloyd commented 1 year ago

What are the values you are using? Don't need the entire values file from the chart, only the ones that you have changed.

ksingh-scogo commented 1 year ago

@caleblloyd pls excuse the chattiness

found the solution just for health check failure , manually editing stateful set scheme from HTTP to HTTPS for livenessProbe , readinessProbe and startupProbe

Because

So when monitor.tls is requested from values.yaml , helm template should update statefulset health check schem to HTTPS

ksingh-scogo commented 1 year ago

Still getting [ERR] Could not find server_id: invalid character 'C' looking for beginning of value in in prom-exporter container

[35] 2023/08/21 14:58:17.670767 [ERR] Could not find server_id: invalid character 'C' looking for beginning of value
[35] 2023/08/21 14:58:47.670179 [ERR] Could not find server_id: invalid character 'C' looking for beginning of value
[35] 2023/08/21 14:59:17.670885 [ERR] Could not find server_id: invalid character 'C' looking for beginning of value
[35] 2023/08/21 14:59:47.670274 [ERR] Could not find server_id: invalid character 'C' looking for beginning of value
[35] 2023/08/21 15:00:17.670546 [ERR] Could not find server_id: invalid character 'C' looking for beginning of value
[35] 2023/08/21 15:00:47.670372 [ERR] Could not find server_id: invalid character 'C' looking for beginning of value
[35] 2023/08/21 15:01:17.670191 [ERR] Could not find server_id: invalid character 'C' looking for beginning of value

All health checks are passing, for the record

image
ksingh-scogo commented 1 year ago

@caleblloyd your pointers on this would be of great help

jjsimps commented 9 months ago

I think this is happening since TLS for monitor api is enabled. When I generate the helm-templated version it has:

      - args:
        - -port=7777
        - -connz
        - -routez
        - -subz
        - -varz
        - -prefix=nats
        - -use_internal_server_id
        - -jsz=all
        - http://localhost:8222/
        image: natsio/prometheus-nats-exporter:0.13.0
        name: prom-exporter
        ports:
        - containerPort: 7777
          name: prom-metrics

Note that is passes http://localhost:8222/ instead of https://localhost:8222/

So the solution would be to have the chart generate the https url instead.

The other issue is that there doesn't seem to be a way to specify the tlscacert/tlscert/tlskey info, or the dns name to use.

So the following solutions may work:

  1. Disable TLS on monitoring port
  2. Fix it in the chart to support TLS properly for prom exporter container

To fix the chart:

  1. When monitor TLS is enabled, generate an appropriate URL (instead of http://localhost:8222/ would need to be https://<some_configurable_hostname>:<monitor_port>/
  2. Allow specifing TLS params