vmware-archive / buildkit-cli-for-kubectl

BuildKit CLI for kubectl is a tool for building container images with your Kubernetes cluster
Other
499 stars 41 forks source link

CLI fails to communicate with a buildkit farm #97

Closed mmisztal1980 closed 2 years ago

mmisztal1980 commented 2 years ago

What steps did you take and what happened We are running a 6-pod buildkit farm. Our CI pipeline contains a step to build & push container images using buildkit:

IMAGE="$(AzureContainerRegistry)/$(Build.Repository.Name):$(Build.BuildNumber)"
CACHE_IMAGE="$(AzureContainerRegistry)/$(Build.Repository.Name):latest"
kubectl build \
  -t ${IMAGE,,} \
  -f $(System.DefaultWorkingDirectory)/Dockerfile \
  --push \
  --registry-secret buildkit \
  --cache-from ${CACHE_IMAGE,,} \
  --cache-to ${CACHE_IMAGE,,} \
  ./

Today, our developers have started reporting multiple occurences of buildkit cli failing to communicate with the buildkit deployment:

time="2021-07-20T09:57:39Z" level=error msg="Internal error occurred: error executing command in container: failed to exec in container: failed to create exec \"907b23b04ed4b841f6d2b22f324e882d1d34d1cd563313d7b78a26f262537b2e\": cannot exec in a stopped state: unknown"
Error: failed to get status: rpc error: code = Unavailable desc = timed out waiting for server handshake
Error: failed to get status: rpc error: code = Unavailable desc = timed out waiting for server handshake
/bin/bash --noprofile --norc /workspace/_temp/0af98d77-12d3-4113-bd02-73741cab1d13.sh
time="2021-07-20T11:11:32Z" level=error msg="unable to upgrade connection: container not found (\"buildkitd\")"

What did you expect to happen We expected our CI step with kubectl build to succeed

Environment Details:

Builder Logs [If applicable, an excerpt from kubectl logs -l app=buildkit from around the time you hit the failure may be very helpful]

k -n azure-devops logs buildkit-6886b9567d-27xwm
time="2021-07-20T09:52:51Z" level=warning msg="using host network as the default"
time="2021-07-20T09:52:51Z" level=info msg="found worker \"gfy4o188qj65qnfd9pnl9itd4\", labels=map[org.mobyproject.buildkit.worker.executor:containerd org.mobyproject.buildkit.worker.hostname:buildkit-6886b9567d-27xwm org.mobyproject.buildkit.worker.snapshotter:overlayfs], platforms=[linux/amd64 linux/386]"
time="2021-07-20T09:52:51Z" level=info msg="found 1 workers, default=\"gfy4o188qj65qnfd9pnl9itd4\""
time="2021-07-20T09:52:51Z" level=warning msg="currently, only the default worker can be used."
time="2021-07-20T09:52:51Z" level=info msg="running server on /run/buildkit/buildkitd.sock"
❯ k -n azure-devops logs buildkit-6886b9567d-4ncq7
time="2021-07-20T09:53:05Z" level=warning msg="using host network as the default"
time="2021-07-20T09:53:05Z" level=info msg="found worker \"v6pe5i7kgcbpgo405hi5981ka\", labels=map[org.mobyproject.buildkit.worker.executor:containerd org.mobyproject.buildkit.worker.hostname:buildkit-6886b9567d-4ncq7 org.mobyproject.buildkit.worker.snapshotter:overlayfs], platforms=[linux/amd64 linux/386]"
time="2021-07-20T09:53:05Z" level=info msg="found 1 workers, default=\"v6pe5i7kgcbpgo405hi5981ka\""
time="2021-07-20T09:53:05Z" level=warning msg="currently, only the default worker can be used."
time="2021-07-20T09:53:05Z" level=info msg="running server on /run/buildkit/buildkitd.sock"
❯ k -n azure-devops logs buildkit-6886b9567d-6vpwm
time="2021-07-20T10:00:50Z" level=warning msg="using host network as the default"
time="2021-07-20T10:00:51Z" level=info msg="found worker \"morzl9t9zvra42giirbcdm1qe\", labels=map[org.mobyproject.buildkit.worker.executor:containerd org.mobyproject.buildkit.worker.hostname:buildkit-6886b9567d-6vpwm org.mobyproject.buildkit.worker.snapshotter:overlayfs], platforms=[linux/amd64 linux/386]"
time="2021-07-20T10:00:51Z" level=info msg="found 1 workers, default=\"morzl9t9zvra42giirbcdm1qe\""
time="2021-07-20T10:00:51Z" level=warning msg="currently, only the default worker can be used."
time="2021-07-20T10:00:51Z" level=info msg="running server on /run/buildkit/buildkitd.sock"
❯ k -n azure-devops logs buildkit-6886b9567d-gs7gw
time="2021-07-20T09:52:41Z" level=warning msg="using host network as the default"
time="2021-07-20T09:52:41Z" level=info msg="found worker \"vqvtkpjug0wg3l87bbzbbriui\", labels=map[org.mobyproject.buildkit.worker.executor:containerd org.mobyproject.buildkit.worker.hostname:buildkit-6886b9567d-gs7gw org.mobyproject.buildkit.worker.snapshotter:overlayfs], platforms=[linux/amd64 linux/386]"
time="2021-07-20T09:52:41Z" level=info msg="found 1 workers, default=\"vqvtkpjug0wg3l87bbzbbriui\""
time="2021-07-20T09:52:41Z" level=warning msg="currently, only the default worker can be used."
time="2021-07-20T09:52:41Z" level=info msg="running server on /run/buildkit/buildkitd.sock"
time="2021-07-20T09:54:31Z" level=warning msg="reference for unknown type: application/vnd.buildkit.cacheconfig.v0"
time="2021-07-20T10:08:13Z" level=warning msg="reference for unknown type: application/vnd.buildkit.cacheconfig.v0"
❯ k -n azure-devops logs buildkit-6886b9567d-jk7vm
time="2021-07-20T09:52:36Z" level=warning msg="using host network as the default"
time="2021-07-20T09:52:36Z" level=info msg="found worker \"pp6xxuov01zfsxgpx6kfx2vs2\", labels=map[org.mobyproject.buildkit.worker.executor:containerd org.mobyproject.buildkit.worker.hostname:buildkit-6886b9567d-jk7vm org.mobyproject.buildkit.worker.snapshotter:overlayfs], platforms=[linux/amd64 linux/386]"
time="2021-07-20T09:52:36Z" level=info msg="found 1 workers, default=\"pp6xxuov01zfsxgpx6kfx2vs2\""
time="2021-07-20T09:52:36Z" level=warning msg="currently, only the default worker can be used."
time="2021-07-20T09:52:36Z" level=info msg="running server on /run/buildkit/buildkitd.sock"

Dockerfile [If applicable, please include your Dockerfile or excerpts related to the failure]

Vote on this request

This is an invitation to the community to vote on issues. Use the "smiley face" up to the right of this comment to vote.

dhiltgen commented 2 years ago

The logs don't seem to show any obvious crash messages from the builder, but the error unable to upgrade connection: container not found (\"buildkitd\")" seems to imply the container is no longer running.

Let's see if we can gather a little more information from your system to try to understand what's going wrong.

Can you gather the Deployment and all the Pod details in -o yaml form and paste them into this issue? Hopefully there will be something interesting in the pod events, or other status fields to shed some light on why it stopped working.

mmisztal1980 commented 2 years ago

Hi,

we've determined that the issues originated from a faulty node with a broken containerd runtime - which caused the pods to get stuck in Terminating state. (including buildkit).