traefik install health check requires ALL pods in cnpg-system, cert-manager, or metallb-system namespaces to be in a running state

troyhebe commented 1 year ago

App Name

traefik

SCALE Version

22.12.0

App Version

2.9.8_17.0.12

Application Events

N/A

Application Logs

N/A

Application Configuration

N/A

Describe the bug

The traefik install has a health check that requires ALL pods in cnpg-system, cert-manager, or metallb-system namespaces to be in a running state. The traefik install runs a health check bash script which has 3 kubectl wait commands and can be seen here:

k3s kubectl describe pod/traefik-manifests-hgtvp -n ix-traefik

Containers:
  traefik-manifests:
    Container ID:  docker://9d3a9d48469f1260388192b24a41cbff9eabfab6b215543d2df5e1b09864a03f
    Image:         tccr.io/truecharts/kubectl:v1.26.0@sha256:06902a576090e5bfae3fd4e9eccc60bfe614adf00fa50ab233772a66062558a7
    Image ID:      docker-pullable://tccr.io/truecharts/kubectl@sha256:06902a576090e5bfae3fd4e9eccc60bfe614adf00fa50ab233772a66062558a7
        Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      /bin/sh <<'EOF'
      touch /tmp/healthy
      echo "installing manifests..."
      kubectl apply --server-side --force-conflicts --grace-period 30 --v=4 -k https://github.com/truecharts/manifests/manifests || kubectl apply --server-side --force-conflicts --grace-period 30 -k https://github.com/truecharts/manifests/manifests || echo "job failed..."
      echo "Install finished..."
      echo "Starting waits and checks..."
      kubectl wait --namespace cnpg-system --for=condition=ready pod --selector=app.kubernetes.io/name=cloudnative-pg --timeout=90s || echo "metallb-system wait failed..."
 kubectl wait --namespace metallb-system --for=condition=ready pod --selector=app=metallb --timeout=90s || echo "metallb-system wait failed..."
      kubectl wait --namespace cert-manager --for=condition=ready pod --selector=app.kubernetes.io/instance=cert-manager --timeout=90s || echo "cert-manager wait failed..."
      cmctl check api --wait=2m || echo "cmctl wait failed..."
      EOF

The issue/bug that I encountered is that the "kubectl wait" will block and eventually timeout causing this test to fail if any pods in the namespace that it is looking for are in the Completed sate.

In my specific case I had a power outage which caused my TrueNAS to shutdown. After the reboot traefik would not start so I decided to simply re-deploy. The re-deploy would ALWAYS fail because cnpg-system and cert-manager had pods in the "Completed" state:

cert-manager          pod/cert-manager-cainjector-ffb4747bb-v8tkh    0/1     Completed   0          26d
cert-manager          pod/cert-manager-webhook-545bd5d7d8-xmjbw      0/1     Completed   0          26d
prometheus-operator   pod/prometheus-operator-5dcffb7cb8-6rl84       0/1     Completed   0          26d
cnpg-system           pod/cnpg-controller-manager-5d74bc79fb-sh6tp   0/1     Completed   1          14d
cert-manager          pod/cert-manager-8444f6f86b-2hmhm              0/1     Completed   0          3d11h
kube-system           pod/openebs-zfs-node-lp7m7                     2/2     Running     0          10h
cert-manager          pod/cert-manager-cainjector-ffb4747bb-qmj78    1/1     Running     0          10h
metallb-system        pod/speaker-vgc22                              1/1     Running     0          10h
prometheus-operator   pod/prometheus-operator-5dcffb7cb8-xs7nm       1/1     Running     0          10h
cert-manager          pod/cert-manager-8444f6f86b-54rbl              1/1     Running     0          10h
cert-manager          pod/cert-manager-webhook-545bd5d7d8-vszgm      1/1     Running     0          10h
kube-system           pod/coredns-75fc8f8fff-plwnj                   1/1     Running     0          10h
kube-system           pod/openebs-zfs-controller-0                   5/5     Running     0          10h
metallb-system        pod/controller-84d6d4db45-l4ggm                1/1     Running     0          10h
cnpg-system           pod/cnpg-controller-manager-5d74bc79fb-6lpp7   1/1     Running     0          10h
ix-joplin             pod/joplin-postgresql-0                        1/1     Running     0          10h
ix-joplin             pod/joplin-joplin-server-5545669c67-87qxs      1/1     Running     0          10h

Even though there are cnpq-system pods in the Running state this wait will always block on the Completed pod:

kubectl wait --namespace cnpg-system --for=condition=ready pod --selector=app.kubernetes.io/name=cloudnative-pg --timeout=90s || echo "metallb-system wait failed..."

The logical goal of this command seems to be to ensure that A SINGLE cnpg-system pod is ready. However what it is really doing is testing to make sure that EVERY cnpg-system pod's are in a ready state and that seems to be a bug.

To Reproduce

get a single pod in the cnpg-system, cert-manager, or metallb-system namespace in a state OTHER than Running
Try deploy traefik

Expected Behavior

N/A

Screenshots

N/A

Additional Context

N/A

I've read and agree with the following

[X] I've checked all open and closed issues and my issue is not there.

PrivatePuffin commented 1 year ago

All these pods should be in a running state. If they are not you've bigger issues than just traefik.

"completed" is not a valid status for any of those pods as i'm aware and the pods should be deletable without issues. The safeguard to prevent users from starting TrueCharts Charts, on an unhealthy system seems to be working correctly here.

troyhebe commented 1 year ago

May I make the suggestion that the error messages on the wait commands be modified to let users know that they should look for and delete any pod in the aforementioned namespaces that are not in the Running state rather than just "metallb-system wait failed...", etc.

This issue is not easy to see right away and at least that way the logs will give users a touch more information to work with.

PrivatePuffin commented 1 year ago

Afaik scale doesnt even show initcontainer logs…

truecharts-admin commented 1 year ago

This issue is locked to prevent necro-posting on closed issues. Please create a new issue or contact staff on discord of the problem persists

truecharts / public