rancher / rio

Application Deployment Engine for Kubernetes
https://rio.io
Apache License 2.0
2.27k stars 228 forks source link

Rio Buildkit not working #1030

Closed citananda closed 4 years ago

citananda commented 4 years ago

Describe the bug The image is not building and I got the message

+ my-project-cicd my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 › step-build-and-push
+ my-project-cicd my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 › step-git-source-source-fs7mq
my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 step-git-source-source-fs7mq {"level":"warn","ts":1587133986.679937,"logger":"fallback-logger","caller":"logging/config.go:69","msg":"Fetch GitHub commit ID from kodata failed: \"ref: refs/heads/master\" is not a valid GitHub commit ID"}
my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 step-git-source-source-fs7mq {"level":"info","ts":1587133987.196259,"logger":"fallback-logger","caller":"git/git.go:103","msg":"Successfully cloned ssh://user@my.gitlab:my-project/my-app.git @ 2183508b9dfc47f71b9ccde82256b44b7b6f3bb9 in path /workspace/source"}
my-app-cicd-v0794bm-62004-4c11d-pod-fd25a2 step-build-and-push error: failed to get status: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: error while dialing: dial tcp 10.43.144.230:8080: i/o timeout"

The ip 10.43.144.230:8080 is Buildkit, and from the node, I can telnet it telnet 10.43.98.4 8080

Trying 10.43.98.4...
Connected to 10.43.98.4.
Escape character is '^]'.

To Reproduce I am not able to reply to that question. My cluster was working fine, then I observe that all linkerd pods were not working anymore (details here: https://github.com/rancher/rio/issues/1028 and https://github.com/rancher/rio/issues/1029) so I decided to uninstall and reinstall rio. It worked fine and now all containers are ok.

Expected behavior Image should be build and push to registry. It was working before I uninstall and reinstall

Kubernetes version & type (GKE, on-prem): kubectl version

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T21:03:42Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.4", GitCommit:"8d8aa39598534325ad77120c120a22b3a990b5ea", GitTreeState:"clean", BuildDate:"2020-03-12T20:55:23Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

Type: Rio version: rio info

Rio Version: v0.7.0 (4afd4901)
Rio CLI Version: v0.7.0 (4afd4901)
Cluster Domain: xxx.on-rio.io
Cluster Domain IPs: 37.187.30.218
System Namespace: rio-system
Wildcard certificates: xxx.on-rio.io(true)

Additional context kubectl get service --all-namespaces output:

NAMESPACE           NAME                                       TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                                        AGE
rio-system          buildkitd                                  ClusterIP      10.43.98.4      <none>        8080/TCP,80/TCP                                160m
rio-system          buildkitd-v0                               ClusterIP      10.43.64.113    <none>        8080/TCP,80/TCP                                160m
citananda commented 4 years ago

So after some investigations, here are my conclusions.

  1. The continuous integration is stucked at the step step-build-and-push error because the pod can't connect to buildkit pod (transport: error while dialing: dial tcp 10.43.144.230:8080: i/o timeout)
  2. From one node of the k8s cluster, I can telnet the buildkit on 8080 so there is no problem on buildkit side: telnet 10.43.189.55 8080
    Trying 10.43.189.55...                                                                                                                                                                                                                                                         
    Connected to 10.43.189.55.
    Escape character is '^]'.
  3. This is a connectivity problem between the taskrun and the buildkit 3.1. It can be a connectivity problem on the taskrun side, because this log:
    linkerd-proxy-injector-888fbb9f6-ln885 proxy-injector time="2020-04-18T07:33:55Z" level=warning msg="couldn't retrieve parent object my-project-cicd-taskrun-my-app-cicd-v0qt9fd-62004-1ab0c; error: rpc error: code = Unimplemented desc = unimplemented resource type: taskrun"
    linkerd-proxy-injector-888fbb9f6-ln885 proxy-injector time="2020-04-18T07:33:55Z" level=info msg="skipped pod/my-app-cicd-v0qt9fd-62004-1ab0c-pod-53fd4c: neither the namespace nor the pod have the annotation \"linkerd.io/inject:enabled\""

    I tried to add the annotation "linkerd.io/inject:enabled" on the namespace my-project-cicd without success, I got the same result. 3.2. It can be a connectivity problem on the buildkit side I tried to telnet 10.43.189.55 8080 from a pod of another namespace, but it is failing At that point, I am lost because I don't know exactly what is the purpose of each part of Rio pods and Rancher system. Can someone guide me please ?

citananda commented 4 years ago

By adding the annotation linkerd.io/inject: enabled to the namespace my-project-cicd, I have got a different result:

+ my-project-cicd my-app-cicd-v0mhrdm-62004-1ab0c-pod-83cc1d › linkerd-proxy
+ my-project-cicd my-app-cicd-v0mhrdm-62004-1ab0c-pod-83cc1d › step-build-and-push
+ my-project-cicd my-app-cicd-v0mhrdm-62004-1ab0c-pod-83cc1d › step-git-source-source-9hvjn
my-app-cicd-v0mhrdm-62004-1ab0c-pod-83cc1d linkerd-proxy time="2020-04-18T08:14:23Z" level=info msg="running version stable-2.6.1"
my-app-cicd-v0mhrdm-62004-1ab0c-pod-83cc1d linkerd-proxy INFO [     0.002185s] linkerd2_proxy Admin interface on 0.0.0.0:4191
my-app-cicd-v0mhrdm-62004-1ab0c-pod-83cc1d linkerd-proxy INFO [     0.002206s] linkerd2_proxy Inbound interface on 0.0.0.0:4143
my-app-cicd-v0mhrdm-62004-1ab0c-pod-83cc1d linkerd-proxy INFO [     0.002211s] linkerd2_proxy Outbound interface on 127.0.0.1:4140
my-app-cicd-v0mhrdm-62004-1ab0c-pod-83cc1d linkerd-proxy INFO [     0.002214s] linkerd2_proxy Tap interface on 0.0.0.0:4190
my-app-cicd-v0mhrdm-62004-1ab0c-pod-83cc1d linkerd-proxy INFO [     0.002217s] linkerd2_proxy Local identity is my-app-cicd-v0mhrdm-62004-1ab0c.my-project-cicd.serviceaccount.identity.linkerd.cluster.local
my-app-cicd-v0mhrdm-62004-1ab0c-pod-83cc1d linkerd-proxy INFO [     0.002221s] linkerd2_proxy Identity verified via linkerd-identity.linkerd.svc.cluster.local:8080 (linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local)
my-app-cicd-v0mhrdm-62004-1ab0c-pod-83cc1d linkerd-proxy INFO [     0.002225s] linkerd2_proxy Destinations resolved via linkerd-dst.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
my-app-cicd-v0mhrdm-62004-1ab0c-pod-83cc1d linkerd-proxy ERR! [     3.003656s] linkerd2_proxy_identity::certify Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"
my-app-cicd-v0mhrdm-62004-1ab0c-pod-83cc1d linkerd-proxy ERR! [    16.005634s] linkerd2_proxy_identity::certify Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"
my-app-cicd-v0mhrdm-62004-1ab0c-pod-83cc1d linkerd-proxy ERR! [    29.007782s] linkerd2_proxy_identity::certify Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"
citananda commented 4 years ago

So I uninstalled, checked that no ressource was missed, reinstalled with the last Rio version (v0.7.1-rc1). Everything worked find, no problem during install, all pods are on. BUT the command rio run -p 80:8080 https://github.com/rancher/rio-demo returns

tender-raman             https://tender-raman-v0-default.5g85ga.on-rio.io:31737   80:8080   1         100%      6 minutes ago   tender-raman-v0xjrg2: not ready; ImageReady: "step-build-and-push" exited with code 1 (image: "docker-pullable://moby/buildkit@sha256:0cf100454dd25079ce79b7417add2ae7ba55c1d4dfa512fd26e7259eac696732"); for logs run: kubectl -n default logs tender-raman-v0xjrg2-ee709-4e40c-pod-61998b -c step-build-and-push(Failed); tender-raman-v0xjrg2 build failed: "step-build-and-push" exited with code 1 (image: "docker-pullable://moby/buildkit@sha256:0cf100454dd25079ce79b7417add2ae7ba55c1d4dfa512fd26e7259eac696732"); for logs run: kubectl -n default logs tender-raman-v0xjrg2-ee709-4e40c-pod-61998b -c step-build-and-push

So I am stucked on step-build-and-push with the error kubectl -n default logs tender-raman-v0xjrg2-ee709-4e40c-pod-61998b -c step-build-and-push:

error: failed to get status: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: error while dialing: dial tcp 10.43.199.245:8080: i/o timeout"

kubectl get services --all-namespaces | grep 10.43.199.245 returns:

telnet 10.43.199.245 8080 returns

Trying 10.43.199.245...
Connected to 10.43.199.245.
Escape character is '^]'.

Please help me to understand what is going on

citananda commented 4 years ago

Finally, I find the solution to develop my own CICD scripts, waiting for a stable release. Anyway, thanks for your help