rancher / rio

Application Deployment Engine for Kubernetes
https://rio.io
Apache License 2.0
2.27k stars 228 forks source link

Rio Continuous Deployment not working #1029

Closed citananda closed 4 years ago

citananda commented 4 years ago

Describe the bug The image is not building and I got the message

+ my-project-cicd my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 › step-build-and-push
+ my-project-cicd my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 › step-git-source-source-fs7mq
my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 step-git-source-source-fs7mq {"level":"warn","ts":1587133986.679937,"logger":"fallback-logger","caller":"logging/config.go:69","msg":"Fetch GitHub commit ID from kodata failed: \"ref: refs/heads/master\" is not a valid GitHub commit ID"}
my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 step-git-source-source-fs7mq {"level":"info","ts":1587133987.196259,"logger":"fallback-logger","caller":"git/git.go:103","msg":"Successfully cloned ssh://user@my.gitlab:my-project/my-app.git @ 2183508b9dfc47f71b9ccde82256b44b7b6f3bb9 in path /workspace/source"}
my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 step-build-and-push error: failed to get status: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: error while dialing: dial tcp 10.43.144.230:8080: i/o timeout"

To Reproduce I am not able to reply to that question. My cluster was working fine, then I observe that all linkerd pods were not working anymore (details here: https://github.com/rancher/rio/issues/1028) so I decided to uninstall and reinstall rio. It worked fine and now all containers are ok.

Expected behavior Image should be build and push to registry

Kubernetes version & type (GKE, on-prem): kubectl version

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T21:03:42Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T20:55:23Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

Type: Rio version: rio info

Rio Version: v0.7.0 (4afd4901)
Rio CLI Version: v0.7.0 (4afd4901)
Cluster Domain: xxx.on-rio.io
Cluster Domain IPs: 37.187.30.218
System Namespace: rio-system
Wildcard certificates: xxx.on-rio.io(true)

Additional context rio system logs output:

rio-controller | time="2020-04-17T16:31:24Z" level=info msg="Starting rio-controller, version: v0.7.0, git commit: 4afd4901"
rio-controller | time="2020-04-17T16:31:24Z" level=info msg="Updating CRD services.rio.cattle.io"
rio-controller | time="2020-04-17T16:31:24Z" level=info msg="Updating CRD stacks.rio.cattle.io"
rio-controller | I0417 16:31:26.231290       1 leaderelection.go:241] attempting to acquire leader lease  rio-system/rio...
rio-controller | time="2020-04-17T16:31:26Z" level=info msg="listening at :443"
rio-controller | I0417 16:31:26.320231       1 leaderelection.go:251] successfully acquired lease rio-system/rio
rio-controller | time="2020-04-17T16:31:30Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting apps/v1, Kind=Deployment controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting extensions/v1beta1, Kind=Ingress controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting admin.rio.cattle.io/v1, Kind=ClusterDomain controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting cert-manager.io/v1alpha2, Kind=Certificate controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting rio.cattle.io/v1, Kind=Router controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting admin.rio.cattle.io/v1, Kind=PublicDomain controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting /v1, Kind=Endpoints controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting rio.cattle.io/v1, Kind=Stack controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting rio.cattle.io/v1, Kind=ExternalService controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting /v1, Kind=Secret controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting rio.cattle.io/v1, Kind=Service controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting apps/v1, Kind=DaemonSet controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting /v1, Kind=Service controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting apps/v1, Kind=StatefulSet controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting gitwatcher.cattle.io/v1, Kind=GitCommit controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting gloo.solo.io/v1, Kind=Settings controller"
rio-controller | time="2020-04-17T16:31:40Z" level=info msg="Starting tekton.dev/v1alpha1, Kind=TaskRun controller"
StrongMonkey commented 4 years ago

Fetch GitHub commit ID from kodata failed: \"ref: refs/heads/master\" is not a valid GitHub commit ID". This sounds like you don't have master branch for your repo?

citananda commented 4 years ago

Yes the master branch exists, this project was successfully builiding before I uninstall / reinstall rio with the same command. I don't know why it is trying to fetch first on github.com because the project is on a hosted gitlab. Next line shows that the clone is working : Successfully cloned ssh://user@my.gitlab:my-project/my-app.git

citananda commented 4 years ago

Maybe it can help, I find this log on linkerd-proxy-injector

time="2020-04-17T16:46:29Z" level=info msg="received admission review request 1b8cd68c-7f93-4cd5-8140-2637b96bacf6"
time="2020-04-17T16:46:29Z" level=info msg="received pod/my-project-cicd-v0nf4l4-62004-1ab0c-pod-2828db"
time="2020-04-17T16:46:29Z" level=warning msg="couldn't retrieve parent object gy-prod-cicd-taskrun-my-project-cicd-v0nf4l4-62004-1ab0c; error: rpc error: code = Unimplemented desc = unimplemented resource type: taskrun"
time="2020-04-17T16:46:29Z" level=info msg="skipped pod/my-project-cicd-v0nf4l4-62004-1ab0c-pod-2828db: neither the namespace nor the pod have the annotation \"linkerd.io/inject:enabled\""
time="2020-04-17T16:48:15Z" level=info msg="received admission review request 0389f534-e224-408b-a64e-e3a0919ca841"
time="2020-04-17T16:48:15Z" level=info msg="received pod/my-project-cicd-v0c7rsx-62004-1ab0c-pod-ab1ae9"
time="2020-04-17T16:48:15Z" level=warning msg="couldn't retrieve parent object gy-prod-cicd-taskrun-my-project-cicd-v0c7rsx-62004-1ab0c; error: rpc error: code = Unimplemented desc = unimplemented resource type: taskrun"
time="2020-04-17T16:48:15Z" level=info msg="skipped pod/my-project-cicd-v0c7rsx-62004-1ab0c-pod-ab1ae9: neither the namespace nor the pod have the annotation \"linkerd.io/inject:enabled\""
StrongMonkey commented 4 years ago

Oh nevermind, looks like there are some connectivity issues. Can you find what service does this IP (10.43.144.230:8080`) belongs to?

citananda commented 4 years ago

With kubectl --namespace <namespace> get pod -o wide I looked at every container in all namespaces of the cluster and none have this ip (10.43.144.230). All ips are 10.42.x.x I see that the linkerd-identity is listening on port 8080. I restarted it but on the next try I also have the error transport: error while dialing: dial tcp 10.43.98.4:8080: i/o timeout so the ip change but is still on 10.43.x.x and no pod got it. The gateway-proxy pod is also listening on port 8080 but it is not its ip. What can I do to solve that issue, please help

citananda commented 4 years ago

buildkit is also listening on port 8080, and it should be used to build the docker image, but even after restarting the pod (and all the linkerd and rio-system pods) it doesn't work. Here is the log of linkerd-install

+ [[ -n '' ]]
+ [[ -n '' ]]
+ linkerd install
+ kubectl apply -f -
namespace/linkerd created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-identity created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-identity created
serviceaccount/linkerd-identity created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-controller created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-controller created
serviceaccount/linkerd-controller created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-destination created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-destination created
serviceaccount/linkerd-destination created
role.rbac.authorization.k8s.io/linkerd-heartbeat created
rolebinding.rbac.authorization.k8s.io/linkerd-heartbeat created
serviceaccount/linkerd-heartbeat created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-web-admin created
serviceaccount/linkerd-web created
customresourcedefinition.apiextensions.k8s.io/serviceprofiles.linkerd.io created
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/trafficsplits.split.smi-spec.io configured
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-prometheus created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-prometheus created
serviceaccount/linkerd-prometheus created
serviceaccount/linkerd-grafana created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-proxy-injector created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-proxy-injector created
serviceaccount/linkerd-proxy-injector created
secret/linkerd-proxy-injector-tls created
mutatingwebhookconfiguration.admissionregistration.k8s.io/linkerd-proxy-injector-webhook-config created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-sp-validator created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-sp-validator created
serviceaccount/linkerd-sp-validator created
secret/linkerd-sp-validator-tls created
validatingwebhookconfiguration.admissionregistration.k8s.io/linkerd-sp-validator-webhook-config created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-tap created
clusterrole.rbac.authorization.k8s.io/linkerd-linkerd-tap-admin created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-tap created
clusterrolebinding.rbac.authorization.k8s.io/linkerd-linkerd-tap-auth-delegator created
serviceaccount/linkerd-tap created
rolebinding.rbac.authorization.k8s.io/linkerd-linkerd-tap-auth-reader created
secret/linkerd-tap-tls created
apiservice.apiregistration.k8s.io/v1alpha1.tap.linkerd.io created
podsecuritypolicy.policy/linkerd-linkerd-control-plane created
role.rbac.authorization.k8s.io/linkerd-psp created
rolebinding.rbac.authorization.k8s.io/linkerd-psp created
configmap/linkerd-config created
secret/linkerd-identity-issuer created
service/linkerd-identity created
deployment.apps/linkerd-identity created
service/linkerd-controller-api created
service/linkerd-destination created
deployment.apps/linkerd-controller created
service/linkerd-dst created
deployment.apps/linkerd-destination created
cronjob.batch/linkerd-heartbeat created
service/linkerd-web created
deployment.apps/linkerd-web created
configmap/linkerd-prometheus-config created
service/linkerd-prometheus created
deployment.apps/linkerd-prometheus created
configmap/linkerd-grafana-config created
service/linkerd-grafana created
deployment.apps/linkerd-grafana created
deployment.apps/linkerd-proxy-injector created
service/linkerd-proxy-injector created
service/linkerd-sp-validator created
deployment.apps/linkerd-sp-validator created
service/linkerd-tap created
deployment.apps/linkerd-tap created
+ linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ no invalid service profiles
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
is running version 2.6.1 but the latest stable version is 2.7.1
see https://linkerd.io/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 2.6.1 but the latest stable version is 2.7.1
see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match
Status check results are √
+ [[ 0 -ne 0 ]]

We can see that linked and control plane are running in version 2.6.1 but the latest stable is 2.7.1

citananda commented 4 years ago

I find something that maybe can help, on socat pod, I ran ps aux and here is the result:

PID   USER     TIME   COMMAND
    1 root       0:00 socat TCP-LISTEN:5442,fork,bind=127.0.0.1,reuseaddr TCP:10.43.98.4:80
    6 root       0:00 /bin/sh
   12 root       0:00 ps aux

The ip 10.43.98.4 is there

citananda commented 4 years ago

On the cluster, I finally find the ip beginning kubectl get service --all-namespaces

NAMESPACE           NAME                                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE
rio-system          buildkitd                                  ClusterIP      10.43.98.4      <none>        8080/TCP,80/TCP                                160m
rio-system          buildkitd-v0                               ClusterIP      10.43.64.113    <none>        8080/TCP,80/TCP                                160m

From the node, I can telnet to port 8080 of 10.43.98.4 telnet 10.43.98.4 8080

Trying 10.43.98.4...
Connected to 10.43.98.4.
Escape character is '^]'.

So why buildkit doesn't reply to rio ?

citananda commented 4 years ago

The root cause of that problem is clear now, I close that issue and create a new one more concise