rancher / rio

Application Deployment Engine for Kubernetes
https://rio.io
Apache License 2.0
2.27k stars 228 forks source link

All linkerd containers CrashLoopBackOff #1028

Closed citananda closed 4 years ago

citananda commented 4 years ago

Describe the bug All containers of the namespace linkerd are in status CrashLoopBackOff

To Reproduce I don't know when exactly it happens, but everything (except that) is fine on my cluster

Expected behavior Status Running

Kubernetes version & type (GKE, on-prem): kubectl version

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T21:03:42Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T20:55:23Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

Type: Rio version: rio info

Rio Version: v0.7.0 (4afd4901)
Rio CLI Version: v0.7.0 (4afd4901)
Cluster Domain: XXX.on-rio.io
Cluster Domain IPs: XXX
System Namespace: rio-system
Wildcard certificates: XXX.on-rio.io(true)

Additional context rio system logs output:

rio-controller | time="2020-04-16T08:11:59Z" level=info msg="Starting rio-controller, version: v0.7.0, git commit: 4afd4901"
rio-controller | time="2020-04-16T08:12:03Z" level=info msg="Updating CRD services.rio.cattle.io"
rio-controller | time="2020-04-16T08:12:03Z" level=info msg="Updating CRD stacks.rio.cattle.io"
rio-controller | I0416 08:12:07.084643       1 leaderelection.go:241] attempting to acquire leader lease  rio-system/rio...
rio-controller | time="2020-04-16T08:12:07Z" level=info msg="listening at :443"
rio-controller | I0416 08:12:07.196553       1 leaderelection.go:251] successfully acquired lease rio-system/rio
rio-controller | time="2020-04-16T08:12:07Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting apps/v1, Kind=Deployment controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting tekton.dev/v1alpha1, Kind=TaskRun controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting rio.cattle.io/v1, Kind=Service controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting rio.cattle.io/v1, Kind=Router controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting rio.cattle.io/v1, Kind=Stack controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting rio.cattle.io/v1, Kind=ExternalService controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting gloo.solo.io/v1, Kind=Settings controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting /v1, Kind=Service controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting /v1, Kind=Secret controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting /v1, Kind=Endpoints controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting gitwatcher.cattle.io/v1, Kind=GitCommit controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting cert-manager.io/v1alpha2, Kind=Certificate controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting apps/v1, Kind=StatefulSet controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting admin.rio.cattle.io/v1, Kind=PublicDomain controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting extensions/v1beta1, Kind=Ingress controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting apps/v1, Kind=DaemonSet controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting admin.rio.cattle.io/v1, Kind=ClusterDomain controller"
rio-controller | time="2020-04-16T08:12:10Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"
rio-controller | E0416 08:12:40.317858       1 controller.go:135] error syncing 'rio-system/gateway-proxy': handler rdns-service: GetDomain: failed to execute a request: Get https://api.on-rio.io/v1/domain/om8l28.on-rio.io: dial tcp: i/o timeout, handler smi: skip processing, requeuing
rio-controller | E0416 08:13:18.755214       1 controller.go:135] error syncing 'gy-prod-cicd/admin-cicd-v0r4pq7': handler service-build: failed to update gy-prod-cicd/admin-cicd-v0r4pq7-e57e0-235bb tekton.dev/v1alpha1, Kind=TaskRun for service-build gy-prod-cicd/admin-cicd-v0r4pq7: Timeout: request did not complete within requested timeout 34s, handler service-build: failed to update gy-prod-cicd/admin-cicd-v0r4pq7-e57e0-235bb tekton.dev/v1alpha1, Kind=TaskRun for service-build gy-prod-cicd/admin-cicd-v0r4pq7: Timeout: request did not complete within requested timeout 34s, handler template: skip processing, requeuing
rio-controller | E0416 08:13:20.262749       1 controller.go:135] error syncing 'gy-prod-cicd/saas-cicd-v0h767j': handler service-build: failed to update gy-prod-cicd/saas-cicd-v0h767j-066d3-d2de1 tekton.dev/v1alpha1, Kind=TaskRun for service-build gy-prod-cicd/saas-cicd-v0h767j: Timeout: request did not complete within requested timeout 34s, handler service-build: failed to update gy-prod-cicd/saas-cicd-v0h767j-066d3-d2de1 tekton.dev/v1alpha1, Kind=TaskRun for service-build gy-prod-cicd/saas-cicd-v0h767j: Timeout: request did not complete within requested timeout 34s, handler template: skip processing, requeuing
rio-controller | E0416 08:13:20.366242       1 controller.go:135] error syncing 'gy-prod-cicd/manager-cicd-v0wmb24': handler service-build: failed to update gy-prod-cicd/manager-cicd-v0wmb24-37a11-cb1d6 tekton.dev/v1alpha1, Kind=TaskRun for service-build gy-prod-cicd/manager-cicd-v0wmb24: Timeout: request did not complete within requested timeout 34s, handler service-build: failed to update gy-prod-cicd/manager-cicd-v0wmb24-37a11-cb1d6 tekton.dev/v1alpha1, Kind=TaskRun for service-build gy-prod-cicd/manager-cicd-v0wmb24: Timeout: request did not complete within requested timeout 34s, handler template: skip processing, requeuing
rio-controller | E0416 08:13:20.615236       1 controller.go:135] error syncing 'gy-dev-cicd/saas-cicd-v0x299c': handler service-build: failed to update gy-dev-cicd/saas-cicd-v0x299c-066d3-7e1f7 tekton.dev/v1alpha1, Kind=TaskRun for service-build gy-dev-cicd/saas-cicd-v0x299c: Timeout: request did not complete within requested timeout 34s, handler service-build: failed to update gy-dev-cicd/saas-cicd-v0x299c-066d3-7e1f7 tekton.dev/v1alpha1, Kind=TaskRun for service-build gy-dev-cicd/saas-cicd-v0x299c: Timeout: request did not complete within requested timeout 34s, handler template: skip processing, requeuing
rio-controller | E0416 08:13:20.780065       1 controller.go:135] error syncing 'gy-dev-cicd/api-fpm-cicd-v04kfbb': handler service-build: failed to update gy-dev-cicd/api-fpm-cicd-v04kfbb-62004-70029 tekton.dev/v1alpha1, Kind=TaskRun for service-build gy-dev-cicd/api-fpm-cicd-v04kfbb: Timeout: request did not complete within requested timeout 34s, handler service-build: failed to update gy-dev-cicd/api-fpm-cicd-v04kfbb-62004-70029 tekton.dev/v1alpha1, Kind=TaskRun for service-build gy-dev-cicd/api-fpm-cicd-v04kfbb: Timeout: request did not complete within requested timeout 34s, handler template: skip processing, requeuing

kubectl get pod -n linkerd

NAME                                      READY   STATUS             RESTARTS   AGE
linkerd-controller-7c4d687d54-stjkw       0/3     CrashLoopBackOff   13         3m55s
linkerd-destination-58b689b6b4-lcr7s      0/2     CrashLoopBackOff   10         3m52s
linkerd-grafana-6cc5cf6756-8dntf          0/2     CrashLoopBackOff   6          3m46s
linkerd-identity-75b47bfdb5-tw4wl         0/2     CrashLoopBackOff   8          3m38s
linkerd-prometheus-5849bdd67d-8b2xs       0/2     CrashLoopBackOff   7          3m35s
linkerd-proxy-injector-59ffdc9bcc-8sqt8   0/2     CrashLoopBackOff   8          3m31s
linkerd-sp-validator-5d87695744-nr4tr     0/2     CrashLoopBackOff   8          3m26s
linkerd-tap-766df6dcd8-jgrf4              0/2     CrashLoopBackOff   7          3m18s
linkerd-web-546f4b86cb-q48nk              0/2     CrashLoopBackOff   7          3m14s
StrongMonkey commented 4 years ago

Can you check linkerd logs? It is possible that you don't have sufficient privileges to launch those containers.

citananda commented 4 years ago

I moved the pod security policy of the project from unrestricted to none. For each container I start here are the logs rio --namespace linkerd logs -a

+ linkerd linkerd-controller-78f7b7ff7c-qwhpk › destination
+ linkerd linkerd-controller-78f7b7ff7c-qwhpk › linkerd-proxy
+ linkerd linkerd-controller-78f7b7ff7c-qwhpk › public-api
+ linkerd linkerd-controller-78f7b7ff7c-qwhpk › linkerd-init
Error opening stream to linkerd/linkerd-controller-78f7b7ff7c-qwhpk: linkerd-proxy
: container "linkerd-proxy" in pod "linkerd-controller-78f7b7ff7c-qwhpk" is waiting to start: PodInitializing
Error opening stream to linkerd/linkerd-controller-78f7b7ff7c-qwhpk: linkerd-init
: container "linkerd-init" in pod "linkerd-controller-78f7b7ff7c-qwhpk" is waiting to start: PodInitializing
Error opening stream to linkerd/linkerd-controller-78f7b7ff7c-qwhpk: public-api
: container "public-api" in pod "linkerd-controller-78f7b7ff7c-qwhpk" is waiting to start: PodInitializing
Error opening stream to linkerd/linkerd-controller-78f7b7ff7c-qwhpk: destination
: container "destination" in pod "linkerd-controller-78f7b7ff7c-qwhpk" is waiting to start: PodInitializing
- linkerd linkerd-controller-78f7b7ff7c-qwhpk
- linkerd linkerd-controller-78f7b7ff7c-qwhpk
- linkerd linkerd-controller-78f7b7ff7c-qwhpk
- linkerd linkerd-controller-78f7b7ff7c-qwhpk
+ linkerd linkerd-controller-78f7b7ff7c-qwhpk › destination
+ linkerd linkerd-controller-78f7b7ff7c-qwhpk › linkerd-proxy
+ linkerd linkerd-controller-78f7b7ff7c-qwhpk › public-api
+ linkerd linkerd-controller-78f7b7ff7c-qwhpk › linkerd-init
Error opening stream to linkerd/linkerd-controller-78f7b7ff7c-qwhpk: public-api
: container "public-api" in pod "linkerd-controller-78f7b7ff7c-qwhpk" is waiting to start: PodInitializing
Error opening stream to linkerd/linkerd-controller-78f7b7ff7c-qwhpk: linkerd-init
: container "linkerd-init" in pod "linkerd-controller-78f7b7ff7c-qwhpk" is waiting to start: PodInitializing
Error opening stream to linkerd/linkerd-controller-78f7b7ff7c-qwhpk: destination
: container "destination" in pod "linkerd-controller-78f7b7ff7c-qwhpk" is waiting to start: PodInitializing
Error opening stream to linkerd/linkerd-controller-78f7b7ff7c-qwhpk: linkerd-proxy
: container "linkerd-proxy" in pod "linkerd-controller-78f7b7ff7c-qwhpk" is waiting to start: PodInitializing
- linkerd linkerd-controller-78f7b7ff7c-qwhpk
- linkerd linkerd-controller-78f7b7ff7c-qwhpk
- linkerd linkerd-controller-78f7b7ff7c-qwhpk
+ linkerd linkerd-controller-78f7b7ff7c-qwhpk › destination
+ linkerd linkerd-controller-78f7b7ff7c-qwhpk › linkerd-proxy
+ linkerd linkerd-controller-78f7b7ff7c-qwhpk › public-api
Error opening stream to linkerd/linkerd-controller-78f7b7ff7c-qwhpk: public-api
: container "public-api" in pod "linkerd-controller-78f7b7ff7c-qwhpk" is waiting to start: PodInitializing
Error opening stream to linkerd/linkerd-controller-78f7b7ff7c-qwhpk: destination
: container "destination" in pod "linkerd-controller-78f7b7ff7c-qwhpk" is waiting to start: PodInitializing
Error opening stream to linkerd/linkerd-controller-78f7b7ff7c-qwhpk: linkerd-proxy
: container "linkerd-proxy" in pod "linkerd-controller-78f7b7ff7c-qwhpk" is waiting to start: PodInitializing
- linkerd linkerd-controller-78f7b7ff7c-qwhpk
- linkerd linkerd-controller-78f7b7ff7c-qwhpk
+ linkerd linkerd-controller-78f7b7ff7c-qwhpk › destination
+ linkerd linkerd-controller-78f7b7ff7c-qwhpk › public-api
linkerd-controller-78f7b7ff7c-qwhpk public-api time="2020-04-17T07:36:27Z" level=info msg="running version stable-2.6.1"
linkerd-controller-78f7b7ff7c-qwhpk public-api time="2020-04-17T07:36:28Z" level=info msg="Using cluster domain: cluster.local"
linkerd-controller-78f7b7ff7c-qwhpk public-api time="2020-04-17T07:36:28Z" level=info msg="waiting for caches to sync"
linkerd-controller-78f7b7ff7c-qwhpk public-api time="2020-04-17T07:36:28Z" level=info msg="caches synced"
linkerd-controller-78f7b7ff7c-qwhpk public-api time="2020-04-17T07:36:28Z" level=info msg="starting admin server on :9995"
linkerd-controller-78f7b7ff7c-qwhpk public-api time="2020-04-17T07:36:28Z" level=info msg="starting HTTP server on :8085"
linkerd-controller-78f7b7ff7c-qwhpk destination time="2020-04-17T07:36:30Z" level=info msg="running version stable-2.6.1"
linkerd-controller-78f7b7ff7c-qwhpk destination time="2020-04-17T07:36:30Z" level=info msg="waiting for caches to sync"
linkerd-controller-78f7b7ff7c-qwhpk destination time="2020-04-17T07:36:30Z" level=info msg="caches synced"
linkerd-controller-78f7b7ff7c-qwhpk destination time="2020-04-17T07:36:30Z" level=info msg="starting admin server on :9996"
linkerd-controller-78f7b7ff7c-qwhpk destination time="2020-04-17T07:36:30Z" level=info msg="starting gRPC server on :8086"
- linkerd linkerd-controller-78f7b7ff7c-qwhpk
+ linkerd linkerd-controller-78f7b7ff7c-qwhpk › linkerd-proxy
linkerd-controller-78f7b7ff7c-qwhpk linkerd-proxy time="2020-04-17T07:36:35Z" level=info msg="running version stable-2.6.1"
linkerd-controller-78f7b7ff7c-qwhpk linkerd-proxy time="2020-04-17T07:36:35Z" level=info msg="Using with pre-existing key: /var/run/linkerd/identity/end-entity/key.p8"
linkerd-controller-78f7b7ff7c-qwhpk linkerd-proxy time="2020-04-17T07:36:35Z" level=info msg="Using with pre-existing CSR: /var/run/linkerd/identity/end-entity/key.p8"
linkerd-controller-78f7b7ff7c-qwhpk linkerd-proxy Invalid configuration: invalid environment variable
linkerd-controller-78f7b7ff7c-qwhpk public-api time="2020-04-17T07:37:05Z" level=info msg="shutting down HTTP server on :8085"
linkerd-controller-78f7b7ff7c-qwhpk destination time="2020-04-17T07:37:08Z" level=info msg="shutting down gRPC server on :8086"

I don't know if this message is related to the bug linkerd-proxy Invalid configuration: invalid environment variable Each time, it is shutting down HTTP server on :8085 and gRPC server on :8086 after starting them.

citananda commented 4 years ago

I find something maybe interresting: kubectl get apiservice Everything is fine execpt this line: v1alpha1.tap.linkerd.io linkerd/linkerd-tap False (MissingEndpoints) 7d16h When I go deeper: kubectl get apiservice v1alpha1.tap.linkerd.io -o yaml

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apiregistration.k8s.io/v1","kind":"APIService","metadata":{"annotations":{},"labels":{"linkerd.io/control-plane-component":"tap","linkerd.io/control-plane-ns":"linkerd"},"name":"v1alpha1.tap.linkerd.io"},"spec":{"caBundle":"XXX","group":"tap.linkerd.io","groupPriorityMinimum":1000,"service":{"name":"linkerd-tap","namespace":"linkerd"},"version":"v1alpha1","versionPriority":100}}
  creationTimestamp: "2020-04-09T17:32:06Z"
  labels:
    linkerd.io/control-plane-component: tap
    linkerd.io/control-plane-ns: linkerd
  name: v1alpha1.tap.linkerd.io
  resourceVersion: "10124425"
  selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1alpha1.tap.linkerd.io
  uid: 9e9c1478-87fc-4787-b587-861261e4bf66
spec:
  caBundle: XXX
  group: tap.linkerd.io
  groupPriorityMinimum: 1000
  service:
    name: linkerd-tap
    namespace: linkerd
    port: 443
  version: v1alpha1
  versionPriority: 100
status:
  conditions:
  - lastTransitionTime: "2020-04-16T08:27:33Z"
    message: endpoints for service/linkerd-tap in "linkerd" have no addresses with
      port name "apiserver"
    reason: MissingEndpoints
    status: "False"
    type: Available
citananda commented 4 years ago

Finally I unistalled and installed again Rio and now things are better but not fully working. kubectl get apiservice

NAME                                    SERVICE                      AVAILABLE   AGE
v1.                                     Local                        True        25d
v1.admin.rio.cattle.io                  Local                        True        69m
v1.admissionregistration.k8s.io         Local                        True        25d
v1.apiextensions.k8s.io                 Local                        True        25d
v1.apps                                 Local                        True        25d
v1.authentication.k8s.io                Local                        True        25d
v1.authorization.k8s.io                 Local                        True        25d
v1.autoscaling                          Local                        True        25d
v1.batch                                Local                        True        25d
v1.coordination.k8s.io                  Local                        True        25d
v1.crd.projectcalico.org                Local                        True        3h35m
v1.enterprise.gloo.solo.io              Local                        True        68m
v1.gateway.solo.io                      Local                        True        68m
v1.gitwatcher.cattle.io                 Local                        True        69m
v1.gloo.solo.io                         Local                        True        68m
v1.monitoring.coreos.com                Local                        True        13d
v1.networking.k8s.io                    Local                        True        25d
v1.rbac.authorization.k8s.io            Local                        True        25d
v1.rio.cattle.io                        Local                        True        69m
v1.scheduling.k8s.io                    Local                        True        25d
v1.storage.k8s.io                       Local                        True        25d
v1alpha1.authentication.istio.io        Local                        True        13d
v1alpha1.caching.internal.knative.dev   Local                        True        67m
v1alpha1.linkerd.io                     Local                        True        68m
v1alpha1.rbac.istio.io                  Local                        True        3h35m
v1alpha1.split.smi-spec.io              Local                        True        68m
v1alpha1.tap.linkerd.io                 linkerd/linkerd-tap          True        68m
v1alpha1.tekton.dev                     Local                        True        67m
v1alpha2.acme.cert-manager.io           Local                        True        68m
v1alpha2.cert-manager.io                Local                        True        68m
v1alpha2.config.istio.io                Local                        True        30h
v1alpha2.linkerd.io                     Local                        True        68m
v1alpha3.networking.istio.io            Local                        True        14d
v1beta1.admissionregistration.k8s.io    Local                        True        25d
v1beta1.apiextensions.k8s.io            Local                        True        25d
v1beta1.authentication.k8s.io           Local                        True        25d
v1beta1.authorization.k8s.io            Local                        True        25d
v1beta1.batch                           Local                        True        25d
v1beta1.certificates.k8s.io             Local                        True        25d
v1beta1.coordination.k8s.io             Local                        True        25d
v1beta1.discovery.k8s.io                Local                        True        25d
v1beta1.events.k8s.io                   Local                        True        25d
v1beta1.extensions                      Local                        True        25d
v1beta1.metrics.k8s.io                  kube-system/metrics-server   True        25d
v1beta1.networking.k8s.io               Local                        True        25d
v1beta1.node.k8s.io                     Local                        True        25d
v1beta1.policy                          Local                        True        25d
v1beta1.rbac.authorization.k8s.io       Local                        True        25d
v1beta1.scheduling.k8s.io               Local                        True        25d
v1beta1.security.istio.io               Local                        True        30h
v1beta1.storage.k8s.io                  Local                        True        25d
v2beta1.autoscaling                     Local                        True        25d
v2beta2.autoscaling                     Local                        True        25d
v3.cluster.cattle.io                    Local                        True        13d
v3.management.cattle.io                 Local                        True        30h

Now, when I run rio run -name my-app-cicd --build-clone-secret gitcredential-ssh --build-branch master --build-dockerfile Dockerfile.prod --build-image-name my-project/my-app:2.10 ssh://user@my.gitlab:my-project/my-app.git Here is the result:

+ my-project-cicd my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 › step-build-and-push
+ my-project-cicd my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 › step-git-source-source-fs7mq
my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 step-git-source-source-fs7mq {"level":"warn","ts":1587133986.679937,"logger":"fallback-logger","caller":"logging/config.go:69","msg":"Fetch GitHub commit ID from kodata failed: \"ref: refs/heads/master\" is not a valid GitHub commit ID"}
my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 step-git-source-source-fs7mq {"level":"info","ts":1587133987.196259,"logger":"fallback-logger","caller":"git/git.go:103","msg":"Successfully cloned ssh://user@my.gitlab:my-project/my-app.git @ 2183508b9dfc47f71b9ccde82256b44b7b6f3bb9 in path /workspace/source"}
my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 step-build-and-push error: failed to get status: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: error while dialing: dial tcp 10.43.144.230:8080: i/o timeout"

Before this problem, this command was working

citananda commented 4 years ago

I close this issue because I uninstall / reinstall rio