vmware-archive / helm-crd

Experimental CRD controller for managing Helm releases
Apache License 2.0
100 stars 13 forks source link

Error: "x" has no deployed releases #33

Open Globegitter opened 5 years ago

Globegitter commented 5 years ago

Recently we started seeing the following error messages in the tiller controller, e.g.:

2018/08/03 09:37:02 Downloading https://kubernetes-charts.storage.googleapis.com/nginx-ingress-0.18.0.tgz ...
2018/08/03 09:37:05 Updating release ingress-nginx-public
2018/08/03 09:37:05 Error updating ingress/nginx-public, will retry: rpc error: code = Unknown desc = "ingress-nginx-public" has no deployed releases

We seem to be getting this for almost all of our things deployed using helm-crd. There is one service that gets the following error:

2018/08/03 09:29:41 Downloading https://kubernetes-charts.storage.googleapis.com/redis-3.3.0.tgz ...
2018/08/03 09:29:47 Updating release prod-redis
2018/08/03 09:29:59 Error updating prod/redis, will retry: rpc error: code = Unknown desc = no Service with the name "api-proxy-redis-metrics" found

But strangely enough that service exists.

We have no idea at this stage why this is happening, it just seemed to happen from one day to another without any major changes we are aware off, we have been on version 0.4.1 of the controller as well as 2.9.1 of tiller for a few weeks now and it has been working smoothly.

In till there are logs like:

[tiller] 2018/08/03 09:28:43 getting history for release ingress-nginx-public
[storage] 2018/08/03 09:28:43 getting release history for "ingress-nginx-public"
[tiller] 2018/08/03 09:28:46 preparing update for ingress-nginx-public

and looking at the tiller configmaps we have configmaps for these services and looking e.g. at the nginx one, the latest version shows:

labels:
    CREATED_AT: "1530139352"
    NAME: ingress-nginx-public
    OWNER: TILLER
    STATUS: PENDING_UPGRADE
    VERSION: "153"

and the version before that shows STATUS: SUPERSEDED

We are still investigating, but is there any way to get more insight on what is going on and what we are getting back from tiller?

jasongwartz commented 5 years ago

We suspect that this was being caused by too many tracked releases and configmaps (over three thousand). We set the TILLER_HISTORY_MAX environment variable, but since tiller couldn't find the last previous successful release, this alone didn't fix it.

The eventual resolution was to delete all the previous configmaps for each release, and then rerun tiller using UpgradeForce (as per https://github.com/bitnami-labs/helm-crd/pull/34), which caused tiller to re-own or recreate resources. The downside is that external components, like our AWS ELB and EBS volumes, were destroyed and recreated, so we lost some historical metric data from Prometheus and had to change some DNS records for the new load balancer.