keel-hq / keel

Kubernetes Operator to automate Helm, DaemonSet, StatefulSet & Deployment updates
https://keel.sh
Mozilla Public License 2.0
2.46k stars 283 forks source link

Failed to setup Tiller tunnel #406

Closed Voltash closed 5 years ago

Voltash commented 5 years ago

Hi after I upgraded keel, keel pod stuck in status CrashLoopBackOff

Normal   Pulled     10m (x5 over 12m)    kubelet, gke-reservation-app-cluster-pool-1-87c0303f-6h80  Container image "keelhq/keel:0.14.3-rc1" already present on machine  
Normal   Created    10m (x5 over 12m)    kubelet, gke-reservation-app-cluster-pool-1-87c0303f-6h80  Created container  
Normal   Started    10m (x5 over 12m)    kubelet, gke-reservation-app-cluster-pool-1-87c0303f-6h80  Started container
Warning  BackOff    2m5s (x50 over 12m)  kubelet, gke-reservation-app-cluster-pool-1-87c0303f-6h80  Back-off restarting failed [container 

Logs from keel pod

time="2019-06-14T10:48:59Z" level=info msg="extension.credentialshelper: helper registered" name=aws
time="2019-06-14T10:48:59Z" level=info msg="bot: registered" name=slack
time="2019-06-14T10:48:59Z" level=info msg="keel starting..." arch=amd64 build_date=2019-06-14T091257Z go_version=go1.12 os=linux revision=86a28a0f version=
time="2019-06-14T10:48:59Z" level=info msg="initializing database" database_path=/data/keel.db type=sqlite3
time="2019-06-14T10:48:59Z" level=info msg="extension.notification.auditor: audit logger configured" name=auditor
time="2019-06-14T10:48:59Z" level=info msg="notificationSender: sender configured" sender name=auditor
time="2019-06-14T10:48:59Z" level=info msg="provider.kubernetes: using in-cluster configuration"
time="2019-06-14T10:48:59Z" level=fatal msg="failed to setup Tiller tunnel" error="forwarding ports: error upgrading connection: pods \"tiller-deploy-7b4c69bc6f-k6r7c\" is forbidden: User \"system:serviceaccount:kube-system:keel\" cannot create resource \"pods/portforward\" in API group \"\" in the namespace \"kube-system\""
rusenask commented 5 years ago

hmm seems like permission error, added pods/portforward to "create" in the chart, updated one should be built in a bit

rusenask commented 5 years ago

can you maybe try locally to modify clusterrole.yaml by adding pods/portforward ? like this:

    verbs:
      - get
      - delete # required to delete pods during force upgrade of the same tag
      - watch
      - list
      - update
  - apiGroups:
      - ""
    resources:
      - configmaps
      - pods/portforward
    verbs:
      - get
      - create
      - update
aleroyer commented 5 years ago

Hi. Just ran into the same issue. Updating the chart solved it :)

Voltash commented 5 years ago

After updating, issue solved. Thanks!

ggrocco commented 5 years ago

I recreate everything using this steps:

$ helm repo add keel-charts https://charts.keel.sh
$ helm repo update
$ helm upgrade --install keel --namespace=kube-system keel-charts/keel 

The clusterrole.yaml is this way:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  creationTimestamp: "2019-06-14T14:50:24Z"
  name: keel
  resourceVersion: "60126726"
  selfLink: /apis/rbac.authorization.k8s.io/v1/clusterroles/keel
  uid: bd897ada-8eb3-bedb-11e9-0ad4bdebc406
rules:
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - secrets
  verbs:
  - get
  - watch
  - list
- apiGroups:
  - ""
  - extensions
  - apps
  - batch
  resources:
  - pods
  - replicasets
  - replicationcontrollers
  - statefulsets
  - deployments
  - daemonsets
  - jobs
  - cronjobs
  verbs:
  - get
  - delete
  - watch
  - list
  - update
- apiGroups:
  - ""
  resources:
  - configmaps
  - pods/portforward
  verbs:
  - get
  - create
  - update

But still crash time="2019-06-14T14:53:43Z" level=fatal msg="failed to setup Tiller tunnel" error="could not find tiller"

Any idea what is wrong?

rusenask commented 5 years ago

not sure, it seems it can create the tunnel to tiller (just like helm cli does) but then cannot find tiller? :/ does your tiller have a different name or something like that?

ggrocco commented 5 years ago

Full standard, namespace: tiller, service port: 44134

rusenask commented 5 years ago
func GetTillerPodImage(client corev1.PodsGetter, namespace string) (string, error) {
    selector := tillerPodLabels.AsSelector()
    pod, err := getFirstRunningPod(client, namespace, selector)
    if err != nil {
        return "", err
    }
    for _, c := range pod.Spec.Containers {
        if c.Name == "tiller" {
            return c.Image, nil
        }
    }
    return "", fmt.Errorf("could not find a tiller pod")
}

so it's looking for a pod which has name "tiller", can you check via k8s manifests keel config as well to see whether the namespace is really "kube-system" ?

ggrocco commented 5 years ago

Sorry @rusenask, but I don't know how do this... can you help me?

ggrocco commented 5 years ago

Before this version, Keel was able to access the tiller tillerAddress: "tiller-deploy.tiller.svc.cluster.local:44134"

ggrocco commented 5 years ago

The Tiller pod has the name as the code search:

spec:
  containers:
  - env:
    - name: TILLER_NAMESPACE
      value: tiller
    name: tiller
rusenask commented 5 years ago

could you try kubectl get pods --all-namespaces and then find keel and do kubectl describe pod -n <keel name> <keel pod name> ?

I think I might add support for env var too cause it would just solve your issue immediately :D

ggrocco commented 5 years ago

@rusenask thanks for the help, was the namespace the problem to up the pod. Now I'm getting an error with I don't know if is a relation with the tiller too:

E0614 16:41:10.989468       1 portforward.go:385] error copying from local connection to remote stream: read tcp4 127.0.0.1:37949->127.0.0.1:46860: read: connection reset by peer
E0614 16:41:50.189454       1 portforward.go:372] error copying from remote stream to local connection: readfrom tcp4 127.0.0.1:37949->127.0.0.1:47090: write tcp4 127.0.0.1:37949->127.0.0.1:47090: write: broken pipe
rusenask commented 5 years ago

tagged a new release, you will be able to set tillerAddress again, that should fix it :)

Voltash commented 5 years ago

After upgrade keel deploys one image multiple time. Before the upgrade, this configuration worked correctly.

keel:
  policy: force
  matchTag: true
  images:
    - repository: image.repository
      tag: image.tag

Logs from keel

time="2019-06-14T19:17:25Z" level=info msg="policy for release default/res-app parsed: force"
time="2019-06-14T19:17:25Z" level=info msg="provider.helm: ignoring" parsed_image_name="eu.gcr.io/oceanwide/reservation-php-app-dev:latest" policy=force target_image_name
=eu.gcr.io/oceanwide/reservation-php-app-dev
time="2019-06-14T19:17:26Z" level=info msg="policy for release kube-system/keel parsed: all"
time="2019-06-14T19:17:26Z" level=info msg="policy for release kube-system/keel parsed: all"
time="2019-06-14T19:17:26Z" level=info msg="policy for release default/res-app parsed: force"
time="2019-06-14T19:17:26Z" level=info msg="policy for release default/res-app parsed: force"
time="2019-06-14T19:20:11Z" level=info msg="provider.helm: release updated" release=res-app version=94
time="2019-06-14T19:20:12Z" level=warning msg="provider.helm: got error while resetting approvals counter after successful update" error="approval not found: record not f
ound" name=res-app namespace=default
time="2019-06-14T19:22:30Z" level=info msg="provider.helm: release updated" release=res-app version=95
time="2019-06-14T19:22:30Z" level=warning msg="provider.helm: got error while resetting approvals counter after successful update" error="approval not found: record not f
ound" name=res-app namespace=default
paulmorabito commented 5 years ago

I ran:

$ helm repo update
$ helm upgrade --install keel --namespace=kube-system keel-charts/keel 

The keel pod is now running however I'm getting the same issue as @ggrocco.

E0629 09:14:08.891362       1 portforward.go:372] error copying from remote stream to local connection: readfrom tcp4 127.0.0.1:46429->127.0.0.1:41736: write tcp4 127.0.0.1:46429->127.0.0.1:41736: write: broken pipe
E0629 09:28:45.289849       1 portforward.go:385] error copying from local connection to remote stream: read tcp4 127.0.0.1:46429->127.0.0.1:38796: read: connection reset by peer

Is there any known fix for this?

rusenask commented 5 years ago

Yes, you can set tiller service address directly here: https://github.com/keel-hq/keel/blob/master/chart/keel/values.yaml#L25

paulmorabito commented 5 years ago

@rusenask thanks. That partially worked but getting a context deadline exceeded error now.

time="2019-06-30T01:40:38Z" level=info msg="keel starting..." arch=amd64 build_date=2019-06-14T170754Z go_version=go1.12 os=linux revision=743309d1 version=0.15.0-rc1
time="2019-06-30T01:40:38Z" level=info msg="initializing database" database_path=/data/keel.db type=sqlite3
time="2019-06-30T01:40:38Z" level=info msg="extension.notification.auditor: audit logger configured" name=auditor
time="2019-06-30T01:40:38Z" level=info msg="notificationSender: sender configured" sender name=auditor
time="2019-06-30T01:40:38Z" level=info msg="provider.kubernetes: using in-cluster configuration"
time="2019-06-30T01:40:38Z" level=info msg="Tiller address specified: tiller-deploy.tiller.svc.cluster.local:44134"
time="2019-06-30T01:40:38Z" level=info msg="provider.helm: tiller address 'tiller-deploy.tiller.svc.cluster.local:44134' supplied"
time="2019-06-30T01:40:38Z" level=info msg="provider.defaultProviders: provider 'kubernetes' registered"
time="2019-06-30T01:40:38Z" level=info msg="provider.defaultProviders: provider 'helm' registered"
time="2019-06-30T01:40:38Z" level=info msg="extension.credentialshelper: helper registered" name=secrets
time="2019-06-30T01:40:38Z" level=info msg="bot.slack.Configure(): Slack approval bot is not configured"
time="2019-06-30T01:40:38Z" level=error msg="bot.Run(): can not get configuration for bot [slack]"
time="2019-06-30T01:40:38Z" level=info msg=started context=watch resource=deployments
time="2019-06-30T01:40:38Z" level=info msg=started context=watch resource=statefulsets
time="2019-06-30T01:40:38Z" level=info msg=started context=buffer
time="2019-06-30T01:40:38Z" level=info msg=started context=watch resource=daemonsets
time="2019-06-30T01:40:38Z" level=info msg=started context=watch resource=cronjobs
time="2019-06-30T01:40:38Z" level=info msg="authentication is not enabled, admin HTTP handlers are not initialized"
time="2019-06-30T01:40:38Z" level=info msg="webhook trigger server starting..." port=9300
time="2019-06-30T01:40:38Z" level=info msg="trigger.poll.manager: polling trigger configured"
time="2019-06-30T01:40:43Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm
time="2019-06-30T01:40:51Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm

and it repeats on and on from there. Nothing seems to be updating, even the non-helm deployments I have.

EDIT: Changing as below, fixed it:

tillerAddress: tiller-deploy:44134
severity1 commented 5 years ago

not sure why tillerAddress: tiller-deploy:44134 doesnt work for me but this worked tillerAddress: tiller-deploy.kube-system:44134 most liekly because my keel is hosted on a different namespace

rchenzheng commented 4 years ago

This is still an issue with the latest helm release

│ time="2020-01-24T15:40:50Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:40:58Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:41:03Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:41:08Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:41:13Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:41:18Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:41:23Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:41:28Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:41:33Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:41:38Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:41:43Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:41:48Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:41:53Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:41:58Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:42:03Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:42:08Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:42:13Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:42:18Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:42:23Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:42:28Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:42:33Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:42:38Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:42:43Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:42:48Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:42:53Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:42:58Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:43:03Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:43:08Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm                                                    │
│ time="2020-01-24T15:43:13Z" level=error msg="provider.defaultProviders: failed to get tracked images" error="context deadline exceeded" provider=helm