redpanda-data / helm-charts

Redpanda Helm Chart
http://redpanda.com
Apache License 2.0
76 stars 96 forks source link

Deploying a fresh cluster via helm chart is stuck at `bootstrap-yaml-envsubst` init container #1564

Open mattfysh opened 2 days ago

mattfysh commented 2 days ago

What happened?

Title

What did you expect to happen?

Works

How can we reproduce it (as minimally and precisely as possible)?. Please include values file.

```console $ helm get values -n --all affinity: {} auditLogging: clientMaxBufferSize: 16777216 enabled: false enabledEventTypes: null excludedPrincipals: null excludedTopics: null listener: internal partitions: 12 queueDrainIntervalMs: 500 queueMaxBufferSizePerShard: 1048576 replicationFactor: null auth: sasl: bootstrapUser: mechanism: SCRAM-SHA-256 enabled: true mechanism: SCRAM-SHA-512 secretRef: sasl-b5af2352 users: [] clusterDomain: cluster.local commonLabels: {} config: cluster: {} node: crash_loop_limit: 5 pandaproxy_client: {} rpk: {} schema_registry_client: {} tunable: compacted_log_segment_size: 67108864 kafka_connection_rate_limit: 1000 log_segment_size_max: 268435456 log_segment_size_min: 16777216 max_compacted_log_segment_size: 536870912 connectors: deployment: create: false enabled: false test: create: false console: affinity: {} annotations: {} automountServiceAccountToken: true autoscaling: enabled: false maxReplicas: 100 minReplicas: 1 targetCPUUtilizationPercentage: 80 commonLabels: {} config: {} configmap: create: false console: config: {} deployment: create: false enabled: true enterprise: licenseSecretRef: key: "" name: "" extraContainers: [] extraEnv: [] extraEnvFrom: [] extraVolumeMounts: [] extraVolumes: [] fullnameOverride: "" global: {} image: pullPolicy: IfNotPresent registry: docker.redpanda.com repository: redpandadata/console tag: "" imagePullSecrets: [] ingress: annotations: {} className: null enabled: false hosts: - host: chart-example.local paths: - path: / pathType: ImplementationSpecific tls: [] initContainers: extraInitContainers: "" livenessProbe: failureThreshold: 3 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 nameOverride: "" nodeSelector: {} podAnnotations: {} podLabels: {} podSecurityContext: fsGroup: 99 runAsUser: 99 priorityClassName: "" readinessProbe: failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 replicaCount: 1 resources: {} secret: create: false enterprise: {} kafka: {} login: github: {} google: {} jwtSecret: "" oidc: {} okta: {} redpanda: adminApi: {} secretMounts: [] securityContext: runAsNonRoot: true service: annotations: {} port: 8080 targetPort: null type: ClusterIP serviceAccount: annotations: {} automountServiceAccountToken: true create: true name: "" strategy: {} tests: enabled: true tolerations: [] topologySpreadConstraints: [] enterprise: license: "" licenseSecretRef: {} external: addresses: - localhost enabled: true service: enabled: true type: NodePort fullnameOverride: "" image: pullPolicy: IfNotPresent repository: docker.redpanda.com/redpandadata/redpanda tag: "" imagePullSecrets: [] license_key: "" license_secret_ref: {} listeners: admin: external: default: advertisedPorts: - 31644 enabled: false port: 9645 tls: cert: external port: 9644 tls: cert: default requireClientAuth: false http: authenticationMethod: null enabled: true external: default: advertisedPorts: - 30912 authenticationMethod: null port: 8083 tls: cert: external requireClientAuth: false kafkaEndpoint: default port: 8082 tls: cert: default requireClientAuth: false kafka: authenticationMethod: null external: default: advertisedPorts: - 30910 authenticationMethod: null port: 9094 tls: cert: external port: 9093 tls: cert: default requireClientAuth: false rpc: port: 33145 tls: cert: default requireClientAuth: false schemaRegistry: authenticationMethod: null enabled: false external: default: advertisedPorts: - 30081 authenticationMethod: null port: 8084 tls: cert: external requireClientAuth: false kafkaEndpoint: default port: 8081 tls: cert: default requireClientAuth: false logging: logLevel: info usageStats: enabled: true monitoring: enabled: false labels: {} scrapeInterval: 30s nameOverride: "" nodeSelector: {} post_install_job: affinity: {} enabled: true podTemplate: annotations: {} labels: {} spec: containers: - env: [] name: post-install securityContext: {} securityContext: {} rackAwareness: enabled: false nodeAnnotation: topology.kubernetes.io/zone rbac: annotations: {} enabled: false resources: cpu: cores: 1 memory: container: max: 2.5Gi serviceAccount: annotations: {} create: false name: "" statefulset: additionalRedpandaCmdFlags: [] additionalSelectorLabels: {} annotations: {} budget: maxUnavailable: 1 extraVolumeMounts: "" extraVolumes: "" initContainerImage: repository: busybox tag: latest initContainers: configurator: extraVolumeMounts: "" resources: {} extraInitContainers: "" fsValidator: enabled: false expectedFS: xfs extraVolumeMounts: "" resources: {} setDataDirOwnership: enabled: false extraVolumeMounts: "" resources: {} setTieredStorageCacheDirOwnership: extraVolumeMounts: "" resources: {} tuning: extraVolumeMounts: "" resources: {} livenessProbe: failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 nodeSelector: {} podAffinity: {} podAntiAffinity: custom: {} topologyKey: kubernetes.io/hostname type: hard weight: 100 podTemplate: annotations: {} labels: {} spec: containers: - env: [] name: redpanda securityContext: {} securityContext: {} priorityClassName: "" readinessProbe: failureThreshold: 3 initialDelaySeconds: 1 periodSeconds: 10 successThreshold: 1 replicas: 1 securityContext: fsGroup: 101 fsGroupChangePolicy: OnRootMismatch runAsUser: 101 sideCars: configWatcher: enabled: true extraVolumeMounts: "" resources: {} securityContext: {} controllers: createRBAC: true enabled: false healthProbeAddress: :8085 image: repository: docker.redpanda.com/redpandadata/redpanda-operator tag: v2.2.4-24.2.5 metricsAddress: :9082 resources: {} run: - all securityContext: {} startupProbe: failureThreshold: 120 initialDelaySeconds: 1 periodSeconds: 10 terminationGracePeriodSeconds: 90 tolerations: [] topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: ScheduleAnyway updateStrategy: type: RollingUpdate storage: hostPath: /dev/kafka persistentVolume: annotations: {} enabled: false labels: {} nameOverwrite: "" size: 20Gi storageClass: "" tiered: config: cloud_storage_cache_size: 5368709120 cloud_storage_enable_remote_read: true cloud_storage_enable_remote_write: true cloud_storage_enabled: false credentialsSecretRef: accessKey: configurationKey: cloud_storage_access_key secretKey: configurationKey: cloud_storage_secret_key hostPath: "" mountType: emptyDir persistentVolume: annotations: {} labels: {} storageClass: "" tests: enabled: true tls: certs: default: caEnabled: true external: caEnabled: true enabled: false tolerations: [] tuning: tune_aio_events: true ```

Anything else we need to know?

No response

Which are the affected charts?

Redpanda

Chart Version(s)

```console $ helm -n list redpanda-ba4e1924 test 1 2024-10-15 13:58:45.893116 +1000 AEST failed redpanda-5.9.7 v24.2.7 ```

Cloud provider

orbstack mac pro m2

JIRA Link: K8S-388

mattfysh commented 2 days ago

currently working around this by using a pulumi chart transform to remove the bootstrap-yaml-envsubstr init container before deployment

strikes me as rather odd that this container which was recently introduced produces no logs, and you cannot exec into the container to try and debug otherwise you get an OCI error because bash/zsh/dash/sh are all not found in PATH. this container probably needs to be battle-tested a little more if the plan is to keep it in the helm chart

chrisseto commented 1 day ago

The container in question uses distroless as a base so the inability to exec is expected. The lack of logs is expected as well actually because the container is only reading and then writing a file (it's effectively just envsubst).

We've not seen any issues with hanging thus far. Could you share any relevant output from kubectl get events and the output of kubectl get pod <any-redpanda-pod> -o yaml? Is your orbstack configured to emulate x86 or is it running arm containers?

chrisseto commented 1 day ago

This should only be affecting >= [redpanda-5.9.6](https://github.com/redpanda-data/helm-charts/releases/tag/redpanda-5.9.6).

If possible, would you mind trying out 5.9.5 to see if anything similarly strange happens?

chrisseto commented 1 day ago

Ah, kubectl describe pod <release-name>-0 would actually be the most helpful. Deploying your values with a few modifications to the secret names seems to startup just fine for me though it certainly looks like the Pod is stuck in init while kubelet complains about a missing secret or configmap.

mattfysh commented 1 day ago

Hi Chris, thanks for the reply. I've been trying all morning to find as much info as I can. Here's what I have so far

Here is a log of pod statuses when deploying successfully via helm CLI. There was only one time when the helm CLI did not produce the configuration job/container, which caused the main redpanda pod to get stuck at "Init:2/3"... however, even in this log below of a successful rollout, isn't it strange that the 3rd init container is taking more than 10 seconds? Does this bootstrap-yaml-envsubst container have any external network dependencies, or similar?

I've only been able to successfully rollout twice with pulumi-kubernetes. I'm not sure why the success rate for pulumi is lower, I suspect they may be using an older version of helm internally?

redpanda-console-866dbb889c-x9fcf   0/1     Running    0             1s
redpanda-configuration-4lx44        0/1     Init:0/1   0             1s
redpanda-0                          0/2     Init:1/3   0             1s

redpanda-console-866dbb889c-x9fcf   0/1     Running    0             1s
redpanda-configuration-4lx44        0/1     Init:0/1   0             1s
redpanda-0                          0/2     Init:2/3   0             1s

redpanda-console-866dbb889c-x9fcf   0/1     Running           0             13s
redpanda-configuration-4lx44        0/1     Init:0/1          0             13s
redpanda-0                          0/2     PodInitializing   0             13s

redpanda-console-866dbb889c-x9fcf   0/1     Running           0             14s
redpanda-configuration-4lx44        0/1     PodInitializing   0             14s
redpanda-0                          0/2     PodInitializing   0             14s

redpanda-console-866dbb889c-x9fcf   0/1     Running           0             15s
redpanda-configuration-4lx44        1/1     Running           0             15s
redpanda-0                          0/2     PodInitializing   0             15s

redpanda-console-866dbb889c-x9fcf   0/1     Running     0             16s
redpanda-configuration-4lx44        0/1     Completed   0             16s
redpanda-0                          1/2     Running     0             16s

Here is the values.yaml I'm using:

statefulset:
  replicas: 1
tls:
  enabled: false
listeners:
  kafka:
    external:
      default:
        advertisedPorts:
        - 31093
  admin:
    enabled: false
    external:
      default:
        enabled: false
  schemaRegistry:
    enabled: false
    external:
      default:
        enabled: false
  http:
    enabled: false
    external:
      default:
        enabled: false

I ran the commands you've provided above and did a diff between a successful rollout and a failed one, but there were no significant differences

Not sure where to go from here, I'm a little confused as to which tool is causing the error as it seems strange to me that I've seen both successful and failed rollouts from both the helm CLI and the pulumi CLI.

mattfysh commented 1 day ago

code available here:

https://github.com/mattfysh/redpanda-deploy-bug

mattfysh commented 1 day ago

I've opened a bug with pulumi-kubernetes: https://github.com/pulumi/pulumi-kubernetes/issues/3265 but I'm still not convinced that's where there root cause lies

is it possible to get some logging from the bootstrap-yaml-envsubst container? orbstack claims the container is running... so it must be doing -- something? can it not be logging every step along the way so we can see where its init process is hanging?

mattfysh commented 1 day ago

The deployment failure is not occurring 100% of the time, suggesting that this is a toolchain issue - not an issue with the deployment values

chrisseto commented 17 hours ago

Thanks continuing to dig into this!

Here's the source of the envsubst container: https://github.com/redpanda-data/redpanda-operator/blob/main/operator/cmd/envsubst/envsubst.go. Disk IO is really the only thing its doing that could be hanging.

Could you run kubectl describe pod redpanda-0 on one of the failing deployments and share the output? I'd like to see the full container status and more importantly any events associated with the Pod.

chrisseto commented 17 hours ago

the bug seems to occur when the helm chart does not produce the configuration job

This is likely a red herring. The configuration job will only run after the StatefulSet becomes Ready which is prevented if the init container hangs.

mattfysh commented 15 hours ago
Name:             redpanda-ceda518c-0
Namespace:        default
Priority:         0
Service Account:  default
Node:             orbstack/198.19.249.2
Start Time:       Thu, 17 Oct 2024 03:02:11 +1000
Labels:           app.kubernetes.io/component=redpanda-statefulset
                  app.kubernetes.io/instance=redpanda-ceda518c
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=redpanda
                  apps.kubernetes.io/pod-index=0
                  controller-revision-hash=redpanda-ceda518c-74f6484f4b
                  helm.sh/chart=redpanda-5.9.7
                  redpanda.com/poddisruptionbudget=redpanda-ceda518c
                  statefulset.kubernetes.io/pod-name=redpanda-ceda518c-0
Annotations:      config.redpanda.com/checksum: f05f9a3c004ec98ce95cc606dac824014cc5bd64cb1c44fe8f6968b659c6d979
Status:           Pending
IP:               192.168.194.42
IPs:
  IP:           192.168.194.42
  IP:           fd07:b51a:cc66:a::2a
Controlled By:  StatefulSet/redpanda-ceda518c
Init Containers:
  tuning:
    Container ID:  docker://e1c9a6bbcfcc27f6f363817b5f8fecaa407469578ea8f45f90d8aad6e277d77a
    Image:         docker.redpanda.com/redpandadata/redpanda:v24.2.7
    Image ID:      docker-pullable://docker.redpanda.com/redpandadata/redpanda@sha256:82a69763bef8d8b55ea5a520fa1b38f993908ef68946819ca1aed43541824c48
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      rpk redpanda tune all
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 17 Oct 2024 03:02:12 +1000
      Finished:     Thu, 17 Oct 2024 03:02:12 +1000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/redpanda from base-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
  redpanda-configurator:
    Container ID:  docker://1de50536c1b7eac009558a7bc1fd0c0b61b2bbf0ce6eecf44e01c34e4f0ee612
    Image:         docker.redpanda.com/redpandadata/redpanda:v24.2.7
    Image ID:      docker-pullable://docker.redpanda.com/redpandadata/redpanda@sha256:82a69763bef8d8b55ea5a520fa1b38f993908ef68946819ca1aed43541824c48
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      trap "exit 0" TERM; exec $CONFIGURATOR_SCRIPT "${SERVICE_NAME}" "${KUBERNETES_NODE_NAME}" & wait $!
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 17 Oct 2024 03:02:13 +1000
      Finished:     Thu, 17 Oct 2024 03:02:13 +1000
    Ready:          True
    Restart Count:  0
    Environment:
      CONFIGURATOR_SCRIPT:   /etc/secrets/configurator/scripts/configurator.sh
      SERVICE_NAME:          redpanda-ceda518c-0 (v1:metadata.name)
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      HOST_IP_ADDRESS:        (v1:status.hostIP)
    Mounts:
      /etc/redpanda from config (rw)
      /etc/secrets/configurator/scripts/ from redpanda-ceda518c-configurator (rw)
      /tmp/base-config from base-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
  bootstrap-yaml-envsubst:
    Container ID:  docker://37eb987a144d998148954cb49183953b5c5f1d70197736abeb19f3dd71eba8e4
    Image:         docker.redpanda.com/redpandadata/redpanda-operator:v2.2.4-24.2.5
    Image ID:      docker-pullable://docker.redpanda.com/redpandadata/redpanda-operator@sha256:17979d5443f420a1791edb067149d841bb8251c534e1c289a8fbc11392a7aca2
    Port:          <none>
    Host Port:     <none>
    Command:
      /redpanda-operator
      envsubst
      /tmp/base-config/bootstrap.yaml
      --output
      /tmp/config/.bootstrap.yaml
    State:          Running
      Started:      Thu, 17 Oct 2024 03:02:14 +1000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  25Mi
    Requests:
      cpu:        100m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /tmp/base-config/ from base-config (rw)
      /tmp/config/ from config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
Containers:
  redpanda:
    Container ID:  
    Image:         docker.redpanda.com/redpandadata/redpanda:v24.2.7
    Image ID:      
    Ports:         9644/TCP, 8082/TCP, 9093/TCP, 9094/TCP, 33145/TCP, 8081/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      rpk
      redpanda
      start
      --advertise-rpc-addr=$(SERVICE_NAME).redpanda-ceda518c.default.svc.cluster.local.:33145
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2560Mi
    Requests:
      cpu:      1
      memory:   2560Mi
    Liveness:   exec [/bin/sh -c curl --silent --fail -k -m 5  "http://${SERVICE_NAME}.redpanda-ceda518c.default.svc.cluster.local.:9644/v1/status/ready"] delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec [/bin/sh -c set -x
RESULT=$(rpk cluster health)
echo $RESULT
echo $RESULT | grep 'Healthy:.*true'
] delay=1s timeout=1s period=10s #success=1 #failure=3
    Startup:  exec [/bin/sh -c set -e
RESULT=$(curl --silent --fail -k -m 5  "http://${SERVICE_NAME}.redpanda-ceda518c.default.svc.cluster.local.:9644/v1/status/ready")
echo $RESULT
echo $RESULT | grep ready
] delay=1s timeout=1s period=10s #success=1 #failure=120
    Environment:
      SERVICE_NAME:  redpanda-ceda518c-0 (v1:metadata.name)
      POD_IP:         (v1:status.podIP)
      HOST_IP:        (v1:status.hostIP)
    Mounts:
      /etc/redpanda from config (rw)
      /tmp/base-config from base-config (rw)
      /var/lib/redpanda/data from datadir (rw)
      /var/lifecycle from lifecycle-scripts (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
  config-watcher:
    Container ID:  
    Image:         docker.redpanda.com/redpandadata/redpanda:v24.2.7
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      trap "exit 0" TERM; exec /etc/secrets/config-watcher/scripts/sasl-user.sh & wait $!
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/redpanda from config (rw)
      /etc/secrets/config-watcher/scripts from redpanda-ceda518c-config-watcher (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  datadir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  datadir-redpanda-ceda518c-0
    ReadOnly:   false
  lifecycle-scripts:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  redpanda-ceda518c-sts-lifecycle
    Optional:    false
  base-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      redpanda-ceda518c
    Optional:  false
  config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  redpanda-ceda518c-configurator:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  redpanda-ceda518c-configurator
    Optional:    false
  redpanda-ceda518c-config-watcher:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  redpanda-ceda518c-config-watcher
    Optional:    false
  kube-api-access-dtgbc:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               <none>
Tolerations:                  node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  topology.kubernetes.io/zone:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/component=redpanda-statefulset,app.kubernetes.io/instance=redpanda-ceda518c,app.kubernetes.io/name=redpanda
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  71s   default-scheduler  Successfully assigned default/redpanda-ceda518c-0 to orbstack
  Normal  Pulled     71s   kubelet            Container image "docker.redpanda.com/redpandadata/redpanda:v24.2.7" already present on machine
  Normal  Created    71s   kubelet            Created container tuning
  Normal  Started    71s   kubelet            Started container tuning
  Normal  Pulled     70s   kubelet            Container image "docker.redpanda.com/redpandadata/redpanda:v24.2.7" already present on machine
  Normal  Created    70s   kubelet            Created container redpanda-configurator
  Normal  Started    70s   kubelet            Started container redpanda-configurator
  Normal  Pulled     69s   kubelet            Container image "docker.redpanda.com/redpandadata/redpanda-operator:v2.2.4-24.2.5" already present on machine
  Normal  Created    69s   kubelet            Created container bootstrap-yaml-envsubst
  Normal  Started    69s   kubelet            Started container bootstrap-yaml-envsubst
mattfysh commented 15 hours ago

I just noticed these very small limits on the init container that is stuck

    Limits:
      cpu:     100m
      memory:  25Mi
    Requests:
      cpu:        100m
      memory:     25Mi

coming from

https://github.com/redpanda-data/helm-charts/blob/56e8cc7a5e8a34ed1a27c8073490e76493ecda86/charts/redpanda/post_install_upgrade_job.go#L55-L64

given the flaky behaviour, im leaning towards this being an issue where the limit is reached - and perhaps this is an Orbstack issue where it is not providing any event or failure handling?