threefoldtech / tf-images

Apache License 2.0
1 stars 3 forks source link

k3s: helm-install-traefik pod falls into CrashLoopBackOff state #121

Closed sameh-farouk closed 1 year ago

sameh-farouk commented 1 year ago

the original issue happens on mainnet and testnet: https://github.com/threefoldtech/tf_support/issues/412#issuecomment-1369799997

I was able to reproduce it on Devnet as well.

I will share the debugging session's findings ASAP. We will need to update the k3s image.

sameh-farouk commented 1 year ago
root@MR113e9c8e:~# kubectl get pods -A

NAMESPACE     NAME                                      READY   STATUS             RESTARTS       AGE
kube-system   local-path-provisioner-84bb864455-dbnmk   1/1     Running            0              28m
kube-system   coredns-96cc4f57d-ds7mv                   1/1     Running            0              28m
kube-system   metrics-server-ff9dbcb6c-svqzb            1/1     Running            0              28m
kube-system   helm-install-traefik--1-fq57z             0/1     CrashLoopBackOff   10 (19s ago)   28m

helm-install-traefik pod keeps restarted, the reason is the error at the end of the pod logs

root@MR113e9c8e:~# kubectl logs -n kube-system helm-install-traefik--1-fq57z
CHART="${CHART//%\{KUBERNETES_API\}%/${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}}"
set +v -x
+ [[ '' != \t\r\u\e ]]
+ + export tiller HELM_HOST=127.0.0.1:44134--listen=127.0.0.1:44134 
--storage=secret
+ HELM_HOST=127.0.0.1:44134
+ helm_v2 init --skip-refresh --client-only --stable-repo-url https://charts.helm.sh/stable/
[main] 2023/01/04 16:07:49 Starting Tiller v2.17.0 (tls=false)
[main] 2023/01/04 16:07:49 GRPC listening on 127.0.0.1:44134
[main] 2023/01/04 16:07:49 Probes listening on :44135
[main] 2023/01/04 16:07:49 Storage driver is Secret
[main] 2023/01/04 16:07:49 Max history per release is 0
Creating /home/klipper-helm/.helm 
Creating /home/klipper-helm/.helm/repository 
Creating /home/klipper-helm/.helm/repository/cache 
Creating /home/klipper-helm/.helm/repository/local 
Creating /home/klipper-helm/.helm/plugins 
Creating /home/klipper-helm/.helm/starters 
Creating /home/klipper-helm/.helm/cache/archive 
Creating /home/klipper-helm/.helm/repository/repositories.yaml 
Adding stable repo with URL: https://charts.helm.sh/stable/ 
Adding local repo with URL: http://127.0.0.1:8879/charts 
$HELM_HOME has been configured at /home/klipper-helm/.helm.
Not installing Tiller due to 'client-only' flag having been set
++ jq -r '.Releases | length'
++ timeout -s KILL 30 helm_v2 ls --all '^traefik$' --output json
[storage] 2023/01/04 16:07:50 listing all releases with filter
+ V2_CHART_EXISTS=
+ [[ '' == \1 ]]
+ [[ '' == \v\2 ]]
+ [[ -n '' ]]
+ shopt -s nullglob
+ helm_content_decode
+ set -e
+ ENC_CHART_PATH=/chart/traefik.tgz.base64
+ CHART_PATH=/tmp/traefik.tgz
+ [[ ! -f /chart/traefik.tgz.base64 ]]
+ return
+ [[ install != \d\e\l\e\t\e ]]
+ helm_repo_init
+ grep -q -e 'https\?://'
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
+ [[ traefik == stable/* ]]
+ [[ -n https://helm.traefik.io/traefik ]]
+ helm_v3 repo add traefik https://helm.traefik.io/traefik
"traefik" has been added to your repositories
+ helm_v3 repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "traefik" chart repository
Update Complete. ⎈Happy Helming!⎈
+ helm_update install --repo https://helm.traefik.io/traefik
+ [[ helm_v3 == \h\e\l\m\_\v\3 ]]
++ helm_v3 ls ++ tr '[:upper:]' '[:lower:]'
--all -f '^traefik$' --namespace kube-system --output json
++ jq -r '"\(.[0].app_version),\(.[0].status)"'
+ LINE=null,null
+ IFS=,
+ read -r INSTALLED_VERSION STATUS _
+ VALUES=
+ for VALUES_FILE in /config/*.yaml
+ VALUES=' --values /config/values-01_HelmChart.yaml'
+ [[ install = \d\e\l\e\t\e ]]
+ [[ null =~ ^(|null)$ ]]
+ [[ null =~ ^(|null)$ ]]
+ helm_v3 install --repo https://helm.traefik.io/traefik traefik traefik --values /config/values-01_HelmChart.yaml
Error: execution error at (traefik/templates/deployment.yaml:3:8): ERROR: Helm >= 3.9.0 is required

I know traefik is integrated into k3s by default, so I want to know how we deploy k3s, I checked the flist docker file and the zinit services to find that:

  1. we start the k3s server passing --no-deploy traefik to prevent the server from deploying the packaged/embedded Traefik component, this option btw marked as deprecated.
  2. we use the Automatically Deployed Manifests feature, and provide traefik.yaml, a custom manifest, and copy it to /var/lib/rancher/k3s/server/manifests. this will be installed at runtime by the rancher/helm-controller.
  3. as we are running k3s v1.22.7 and the install of the custom manifest we provide fails, I compared it with the default one shipped with k3s, and there were some differences I believe is the main cause of the issue, like the version (we are installing 2.7.0 while this version of k3s comes with Traefik 2.6.1) and the source repo (we are using Traefik repo while the packaged one comes from rancher instead which could be crafted for k3s). so my guess is the version, or the chart itself is not compatible with k3s (or at least k3s 1.22.7).
  4. reading through the docs and some external resources I found that the proper way to configure Traefik on K3s, is to provide a HelmChartConfig manifest, instead of the HelmChart manifest we provide. we shouldn't use --no-deploy traefik on server start, as we don't want to use different ingress controller than Traefik, we just need to apply our config on top of the k3s embedded one.

I tried this fix on the fly to make sure that my findings are correct. Here are my steps:

if [ -z "${K3S_DATA_DIR}" ]; then K3S_DATA_DIR="" else cp -r /var/lib/rancher/k3s/* $K3S_DATA_DIR K3S_DATA_DIR="--data-dir $K3S_DATA_DIR --kubelet-arg=root-dir=$K3S_DATA_DIR/kubelet" fi

if [ -z "${K3S_FLANNEL_IFACE}" ]; then K3S_FLANNEL_IFACE=eth0 fi

if [ "$K3S_URL" = "" ]; then k3s server --flannel-iface $K3S_FLANNEL_IFACE $K3S_DATA_DIR >> /var/log/k3s-service.log 2>&1 else k3s agent --flannel-iface $K3S_FLANNEL_IFACE $K3S_DATA_DIR >> /var/log/k3s-service.log 2>&1 fi

- I moved `/var/lib/rancher/k3s/server/manifests/traefik.yaml` to `/tmp` and created this one instead `traefik-config.yaml`
```sh
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: traefik
  namespace: kube-system
spec:
  valuesContent: |-
    additionalArguments:
      - "--certificatesresolvers.default.acme.tlschallenge"
      - "--certificatesresolvers.default.acme.email=dsafsdajfksdhfkjadsfoo@you.com"
      - "--certificatesresolvers.default.acme.storage=/data/acme.json"
      - "--certificatesresolvers.default.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory"
      - "--certificatesresolvers.default.acme.httpchallenge.entrypoint=web"
      - "--certificatesresolvers.gridca.acme.tlschallenge"
      - "--certificatesresolvers.gridca.acme.email=dsafsdajfksdhfkjadsfoo@you.com"
      - "--certificatesresolvers.gridca.acme.storage=/data/acme1.json"
      - "--certificatesresolvers.gridca.acme.caserver=https://ca1.grid.tf"
      - "--certificatesresolvers.gridca.acme.httpchallenge.entrypoint=web"
      - "--certificatesresolvers.le.acme.tlschallenge"
      - "--certificatesresolvers.le.acme.email=dsafsdajfksdhfkjadsfoo@you.com"
      - "--certificatesresolvers.le.acme.storage=/data/acme2.json"
      - "--certificatesresolvers.le.acme.caserver=https://acme-v02.api.letsencrypt.org/directory"
      - "--certificatesresolvers.le.acme.httpchallenge.entrypoint=web"
    ports:
      web:
        redirectTo: websecure
      websecure:
        tls:
          enabled: true

then

kubectl -n kube-system delete helmcharts.helm.cattle.io traefik

After that i want to see if traefik was successfully installed. the helm-install-traefik pod ran successfully. however, helm-install-traefik-crd was exit with error. I checked the logs of that pod and it shows this error

Error: rendered manifests contain a resource that already exists. Unable to continue with install: CustomResourceDefinition "ingressroutes.traefik.containo.us" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"; annotation validation error: missing key "meta.helm.sh/release-name": must be set to "traefik-crd"; annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "kube-system"

These were leftover from the previously failed install, so I deleted those

kubectl get crds --no-headers=true | awk '/traefik/{print $1}'| xargs  kubectl delete crds

restarted the pod

kubectl get pod/helm-install-traefik-crd--1-cx7bb -n kube-system -o yaml | kubectl replace --force -f -

and everything went fine

root@MR113e9c8e:~# kubectl get all -A
NAMESPACE     NAME                                          READY   STATUS      RESTARTS   AGE
kube-system   pod/local-path-provisioner-84bb864455-dbnmk   1/1     Running     0          3h29m
kube-system   pod/coredns-96cc4f57d-ds7mv                   1/1     Running     0          3h29m
kube-system   pod/metrics-server-ff9dbcb6c-svqzb            1/1     Running     0          3h29m
kube-system   pod/svclb-traefik-hcprh                       2/2     Running     0          169m
kube-system   pod/helm-install-traefik--1-f29hk             0/1     Completed   0          169m
kube-system   pod/svclb-traefik-ms9p4                       2/2     Running     0          169m
kube-system   pod/traefik-f75f5998-zj4n8                    1/1     Running     0          169m
kube-system   pod/helm-install-traefik-crd--1-r4drp         0/1     Completed   0          91m

NAMESPACE     NAME                     TYPE           CLUSTER-IP     EXTERNAL-IP           PORT(S)                      AGE
default       service/kubernetes       ClusterIP      10.43.0.1      <none>                443/TCP                      3h30m
kube-system   service/kube-dns         ClusterIP      10.43.0.10     <none>                53/UDP,53/TCP,9153/TCP       3h30m
kube-system   service/metrics-server   ClusterIP      10.43.162.26   <none>                443/TCP                      3h30m
kube-system   service/traefik          LoadBalancer   10.43.235.87   10.20.2.2,10.20.2.3   80:31346/TCP,443:30163/TCP   169m

NAMESPACE     NAME                           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
kube-system   daemonset.apps/svclb-traefik   2         2         2       2            2           <none>          169m

NAMESPACE     NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/local-path-provisioner   1/1     1            1           3h30m
kube-system   deployment.apps/coredns                  1/1     1            1           3h30m
kube-system   deployment.apps/metrics-server           1/1     1            1           3h30m
kube-system   deployment.apps/traefik                  1/1     1            1           169m

NAMESPACE     NAME                                                DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/local-path-provisioner-84bb864455   1         1         1       3h29m
kube-system   replicaset.apps/coredns-96cc4f57d                   1         1         1       3h29m
kube-system   replicaset.apps/metrics-server-ff9dbcb6c            1         1         1       3h29m
kube-system   replicaset.apps/traefik-f75f5998                    1         1         1       169m

NAMESPACE     NAME                                 COMPLETIONS   DURATION   AGE
kube-system   job.batch/helm-install-traefik       1/1           10s        169m
kube-system   job.batch/helm-install-traefik-crd   1/1           77m        169m
sameh-farouk commented 1 year ago

should i just fix the current k3s image (1.22.7), or also upgrade k3s version (1.26.0) @xmonader?

xmonader commented 1 year ago

Please do the upgrade as well. Thank you!

sameh-farouk commented 1 year ago

Update: PR ready for review https://github.com/threefoldtech/tf-images/pull/122

sameh-farouk commented 1 year ago

please cross-link or promote this flist to tf-official-apps @maxux samehabouelsaad.3bot/abouelsaad-k3s_1.26.0-latest.flist -> tf-official-apps/threefoldtech-k3s-latest.flist

sameh-farouk commented 1 year ago

Update: a new Flist is available now with the latest Kubernetes release 1.26 and a few fixes and improvements. It includes a fix for this issue as we now use the embedded Traefik component vs overriding the prepackaged manifest. https://github.com/threefoldtech/tf-images/pull/122 The update will take place, as soon as the new flist gets promoted to the official apps' repo.