vmware-archive / bitfusion-with-kubernetes-integration

Bitfusion with Kubernetes Integration Support
51 stars 23 forks source link

InitContainer Stuck at Pending #54

Closed jcmickey33 closed 2 years ago

jcmickey33 commented 2 years ago

Describe the bug

So first of all, I will be up front about the fact that I'm not entirely certain if "bug report" is the best place to put this, but we'll start here. If there's a better way to categorize it, I'm happy to do so. Secondly, I'm working on two on air-gapped (both from each other and from the internet), proprietary environments so I won't be able to provide screenshots or anything but I'll be as descriptive as I can and will answer as many questions as possible if there are any.

The issue: when I deploy Bitfusion using the provided YAML files (only updating things to reflect my environments), the device plugins and webhoooks deploy without issue. However, when I spin up a pod to actually USE Bitfusion, one environment fires up the client as the initContainer, and then continues on and works fine. The other gets stuck at "Pending" and the initContainer never spins up.

Even if this isn't a true bug, I've checked everything I can think of and would love some suggestions about things I could look into.

Reproduction steps

1.make deploy
2.kubectl apply -f deployment.yaml
3.container gets stuck in Pending state

Expected behavior

Expect workload container to deploy and reach "Running" state.

Additional context

I'm deploying Bitfusion within a kubernetes cluster. I pared down the Makefile, removing some of the variables, and essentially skipping the "update" piece of the "deploy" section.

I have the most recent versions (the 0.4 tags) of the bitfusion-webhook, bitfusion-device-plugin, and bitfusion-client containers in my environment. One environment ends up with a running workload container that's able to contact the remote Bitfusion server. The other never gets to the "Init:0/1" state and hangs at "Pending."

I have checked myself and sat down with the engineer that built the two environments and have confirmed that everything was set up so that each of the two air-gapped environments would be identical to the other.

If I deploy a container using the client image with no Bitfusion annotations or resource requests (and thus no InitContainer), it deploys successfully. If I then exec in and manually copy the libraries to /opt/bitfusion (line 15 of bitfusion-webhook-injector-configmap.yaml), and copy the ca.crt, servers.conf, and client.yaml where they need to be, I'm able to run a "list_gpus" command and get expected results.

The fact that I can work around the issue and manually carry out the steps supposed to be done by the initContainer and it works seems to indicate to me that it's a problem with the initContainer. Yet in an identical state, it works on my other environment. Googling a container stuck in "Pending" mentions the initContainer not deploying as a likely cause, as well.

jcmickey33 commented 2 years ago

Hardware issue found, fixed, everything works now.