threefoldtech / js-sdk

extensions to js-ng for tfgrid
Apache License 2.0
14 stars 6 forks source link

helm reports wrong Traefik app version, as a penalty for how we upgrade it. #3259

Closed sameh-farouk closed 3 years ago

sameh-farouk commented 3 years ago

Description

helm doesn't provide a straightforward way to install a chart based on a specific app version.

as the current implementation of upgrade_traefik we use a constant that refers to a specific app version, then in our case, we are just updating the image tag used for the Traefik container (overriding the value used elsewhere in the chart to set the container version). this is works to some degree but with drawbacks.

root@zosv2-04:~# helm --kubeconfig /root/sandbox/cfg/vdc/kube/samehabouelsaad/dev18072021.yaml --namespace kube-system install --timeout 7m0s --create-namespace traefik traefik/traefik  --version 9.8.4 -f <(echo -e 'image:
  tag: 2.4.8
additionalArguments:
  ....)

the first con, this doesn't update the APP VERSION listed in the output of helm ls

root@zosv2-04:~# helm list  -A
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/.kube/config
NAME    NAMESPACE   REVISION    UPDATED                                 STATUS      CHART           APP VERSION
traefik kube-system 1           2021-07-26 13:09:41.189834672 +0000 UTC deployed    traefik-10.1.1  2.4.9

to find what is really deployed and running, we can describe the Traefik deployment:

root@zosv2-04:~# kubectl describe deployment traefik -n kube-system   
Name:                   traefik
Namespace:              kube-system
...
Pod Template:
  ...
  Containers:
   traefik:
    Image:       traefik:2.4.8
...

so now we have our desired version, however every time the service will run and check whether the Traefik needs to update or not, it will always decide to uninstall and reinstall our desired version again although it was already installed.

[-] threebot: 2021-07-18 15:54:10.710 | INFO     | upgrade_traefik:job:16 - Upgrade Traefik Service:: Updating traefik from 2.4.9 to 2.4.8
...
[-] threebot: 2021-07-18 15:54:17.320 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: helm --kubeconfig /root/sandbox/cfg/vdc/kube/samehabouelsaad/dev18072021.yaml --namespace kube-system delete traefik
...
[-] threebot: 2021-07-18 15:54:19.624 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: helm --kubeconfig /root/sandbox/cfg/vdc/kube/samehabouelsaad/dev18072021.yaml --namespace kube-system install --timeout 7m0s --create-namespace traefik traefik/traefik  -f <(echo -e 'image:
[-] threebot:   tag: 2.4.8
[-] threebot: additionalArguments:
...

this could be fixed by getting the container version for kubectl, however, there is another drawback using the app version to check and upgrade Traefik, which it is not convenient to upgrade to a specific chart version if it uses the same app version as a previous chart. the newer chart could have a fix we need but use the same app version. so if there are multiple chart releases with the same app version and given an app version, there is no straightforward way to tell what is the desired chart to install.

as a real example, all these Traefik charts 10.0.0, 10.0.1, 10.0.2, 10.1.0, 10.1.1 all use same Trefik app version which is 2.4.9.

an implementation suggestion:

we should update the implementation to incorporate a chart version instead of the app version. and install the chart using helm --version option.

Version information

Steps to reproduce

1 - import the UpgradeTraefik service and invoke the job method. it will always try to uninstall the current version and reinstall it.

root@zosv2-04:~# jsng
JS-NG> from jumpscale.packages.vdc_dashboard.services.upgrade_traefik import UpgradeTraefik                                                                                                                        
JS-NG> UpgradeTraefik().job()

2- note the app version report by the service (or the upgrade_traefic method). it will be incorrect. 5- note the app version reported by helm list. it is incorrect, due to the reason clarified above. 6- to check the actual app version deployed and running, you can describe Traefik deployment.

Traceback/Logs/Alerts

[-] threebot: 2021-07-18 15:54:10.710 | INFO     | upgrade_traefik:job:16 - Upgrade Traefik Service:: Updating traefik from 2.4.9 to 2.4.8
[-] threebot: 2021-07-18 15:54:11.217 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: helm --kubeconfig /root/sandbox/cfg/vdc/kube/samehabouelsaad/dev18072021.yaml repo add traefik https://helm.traefik.io/traefik
[-] threebot: 2021-07-18 15:54:11.409 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: kubectl --kubeconfig /root/.kube/config get nodes -o json
[-] threebot: 2021-07-18 15:54:11.425 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: kubectl --kubeconfig /root/.kube/config get deployments -A -o json
[-] threebot: 2021-07-18 15:54:12.610 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: helm --kubeconfig /root/sandbox/cfg/vdc/kube/samehabouelsaad/dev18072021.yaml repo update
[-] threebot: 2021-07-18 15:54:13.142 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: kubectl --kubeconfig /root/.kube/config describe nodes
[-] threebot: 2021-07-18 15:54:13.153 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: kubectl --kubeconfig /root/.kube/config get statefulset -A -o json
[-] threebot: 2021-07-18 15:54:14.616 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: helm --kubeconfig /root/sandbox/cfg/vdc/kube/samehabouelsaad/dev18072021.yaml --namespace kube-system list -o json
[-] threebot: 2021-07-18 15:54:14.630 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: helm --kubeconfig /root/sandbox/cfg/vdc/kube/samehabouelsaad/dev18072021.yaml --namespace kube-system list -o json
[-] threebot: 2021-07-18 15:54:15.103 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: kubectl --kubeconfig /root/.kube/config get nodes -o wide -o json
[-] threebot: 2021-07-18 15:54:15.118 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: helm get values --kubeconfig /root/.kube/config --namespace=default traefik -o json
[-] threebot: 2021-07-18 15:54:16.001 | INFO     | jumpscale.sals.vdc.kubernetes_auto_extend:update_stats:99 - Kubernetes stats: {'k3os-25407': {'cpu': {'total': 2000, 'used': 1000}, 'memory': {'total': 3935.55859375, 'used': 640}, 'wid': 58641}, 'k3os-31149': {'cpu': {'total': 1000, 'used': 600}, 'memory': {'total': 1987.11328125, 'used': 582}, 'wid': 58640}}
[-] threebot: 2021-07-18 15:54:16.125 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: helm --kubeconfig /root/sandbox/cfg/vdc/kube/samehabouelsaad/dev18072021.yaml --namespace kube-system list -o json
[-] threebot: 2021-07-18 15:54:16.137 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: helm --kubeconfig /root/sandbox/cfg/vdc/kube/samehabouelsaad/dev18072021.yaml --namespace kube-system list -o json
[-] threebot: 2021-07-18 15:54:16.586 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: kubectl --kubeconfig /root/.kube/config get ingress -n kube-system -l app.kubernetes.io/instance=traefik -o=jsonpath='{.items[0].spec.rules[0].host}'
[-] threebot: 2021-07-18 15:54:17.317 | INFO     | domain:job:25 - restoring domain for deployment: {'Release': 'traefik', 'Version': None, 'Creation': '2021-07-18 15:49:26+00:00', 'Status': 'Running', 'Status Details': [{'lastTransitionTime': '2021-07-18T15:49:27Z', 'lastUpdateTime': '2021-07-18T15:49:27Z', 'message': 'Deployment has minimum availability.', 'reason': 'MinimumReplicasAvailable', 'status': 'True', 'type': 'Available'}, {'lastTransitionTime': '2021-07-18T15:49:27Z', 'lastUpdateTime': '2021-07-18T15:49:44Z', 'message': 'ReplicaSet "traefik-576846495f" has successfully progressed.', 'reason': 'NewReplicaSetAvailable', 'status': 'True', 'type': 'Progressing'}], 'User Supplied Values': {}, 'VDC Name': 'dev18072021', 'Domain': '', 'Chart': 'traefik', 'Namespace': 'kube-system'}
[-] threebot: 2021-07-18 15:54:17.320 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: helm --kubeconfig /root/sandbox/cfg/vdc/kube/samehabouelsaad/dev18072021.yaml --namespace kube-system delete traefik
[-] threebot: 2021-07-18 15:54:17.333 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: helm --kubeconfig /root/sandbox/cfg/vdc/kube/samehabouelsaad/dev18072021.yaml --namespace k3os-system delete traefik
[-] threebot: Traceback (most recent call last):
[-] threebot:   File "src/gevent/greenlet.py", line 906, in gevent._gevent_cgreenlet.Greenlet.run
[-] threebot:   File "/sandbox/code/github/threefoldtech/js-sdk/jumpscale/sals/vdc/kubernetes.py", line 560, in clean_traefik
[-] threebot:     manager.delete_deployed_release("traefik", ns)
[-] threebot:   File "/sandbox/code/github/threefoldtech/js-sdk/jumpscale/sals/kubernetes/manager.py", line 25, in wrapper
[-] threebot:     return method(self, *args, **kwargs)
[-] threebot:   File "/sandbox/code/github/threefoldtech/js-sdk/jumpscale/sals/kubernetes/manager.py", line 181, in delete_deployed_release
[-] threebot:     raise j.exceptions.Runtime(f"Failed to deploy chart {release} , error was {err}")
[-] threebot: jumpscale.core.exceptions.exceptions.Runtime: Failed to deploy chart traefik , error was WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /root/sandbox/cfg/vdc/kube/samehabouelsaad/dev18072021.yaml
[-] threebot: WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /root/sandbox/cfg/vdc/kube/samehabouelsaad/dev18072021.yaml
[-] threebot: Error: uninstall: Release not loaded: traefik: release: not found
[-] threebot: 
[-] threebot: 2021-07-18T15:54:19Z <Greenlet at 0x7fbd481bb370: clean_traefik(<jumpscale.sals.kubernetes.manager.Manager object , 'k3os-system')> failed with Runtime
[-] threebot: 
[-] threebot: 2021-07-18 15:54:19.624 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: helm --kubeconfig /root/sandbox/cfg/vdc/kube/samehabouelsaad/dev18072021.yaml --namespace kube-system install --timeout 7m0s --create-namespace traefik traefik/traefik  -f <(echo -e 'image:
[-] threebot:   tag: 2.4.8
[-] threebot: additionalArguments:
[-] threebot:   - "--certificatesresolvers.default.acme.tlschallenge"
[-] threebot:   - "--certificatesresolvers.default.acme.email=dsafsdajfksdhfkjadsfoo@you.com"
[-] threebot:   - "--certificatesresolvers.default.acme.storage=/data/acme.json"
[-] threebot:   - "--certificatesresolvers.default.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory"
[-] threebot:   - "--certificatesresolvers.default.acme.httpchallenge.entrypoint=web"
[-] threebot:   - "--certificatesresolvers.gridca.acme.tlschallenge"
[-] threebot:   - "--certificatesresolvers.gridca.acme.email=dsafsdajfksdhfkjadsfoo@you.com"
[-] threebot:   - "--certificatesresolvers.gridca.acme.storage=/data/acme1.json"
[-] threebot:   - "--certificatesresolvers.gridca.acme.caserver=https://ca1.grid.tf"
[-] threebot:   - "--certificatesresolvers.gridca.acme.httpchallenge.entrypoint=web"
[-] threebot:   - "--certificatesresolvers.le.acme.tlschallenge"
[-] threebot:   - "--certificatesresolvers.le.acme.email=dsafsdajfksdhfkjadsfoo@you.com"
[-] threebot:   - "--certificatesresolvers.le.acme.storage=/data/acme2.json"
[-] threebot:   - "--certificatesresolvers.le.acme.caserver=https://acme-v02.api.letsencrypt.org/directory"
[-] threebot:   - "--certificatesresolvers.le.acme.httpchallenge.entrypoint=web"
[-] threebot: ports:
[-] threebot:   web:
[-] threebot:     redirectTo: websecure
[-] threebot:   websecure:
[-] threebot:     tls:
[-] threebot:       enabled: true')
[-] threebot: 2021-07-18 15:54:24.523 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: kubectl --kubeconfig /root/.kube/config get nodes -o json
[-] threebot: 2021-07-18 15:54:25.136 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: kubectl --kubeconfig /root/.kube/config describe nodes
[-] threebot: 2021-07-18 15:54:25.994 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: kubectl --kubeconfig /root/.kube/config get nodes -o wide -o json
[-] threebot: 2021-07-18 15:54:26.385 | INFO     | jumpscale.sals.vdc.kubernetes_auto_extend:update_stats:99 - Kubernetes stats: {'k3os-25407': {'cpu': {'total': 2000, 'used': 1000}, 'memory': {'total': 3935.55859375, 'used': 640}, 'wid': 58641}, 'k3os-31149': {'cpu': {'total': 1000, 'used': 600}, 'memory': {'total': 1987.11328125, 'used': 582}, 'wid': 58640}}
sameh-farouk commented 3 years ago

Verified jsdk d727cf38b244c2bfd489 it is working now as expected.

JS-NG> from jumpscale.packages.vdc_dashboard.services.upgrade_traefik import UpgradeTraefik                                                                                                                        
JS-NG> UpgradeTraefik().job()                                                                                                                                                                                      
2021-09-08 15:11:04.730 | DEBUG    | jumpscale.sals.kubernetes.manager:_execute:45 - kubernetes manager: helm list -A -o json
2021-09-08 15:11:05.836 | INFO     | jumpscale.packages.vdc_dashboard.services.upgrade_traefik:job:20 - Upgrade Traefik Service:: Traefik using latest version 9.20.1
sameh-farouk commented 3 years ago

this was possibly the cause of #3270 too

sameh-farouk commented 3 years ago

closes as it was fixed in #3274 and verified.