ferris-cx commented 2 months ago

Create a POD to apply for GPU, enter the container through kubectl exec, execute nvidia-smi, I can see all GPU (4) device information,。 I should only see one GPU device that has been assigned to this Pod The GPU visibility is not limited, the reason should be the container internal environment variable env: NVIDIA_VISIBLE_DEVICE=ALL, it should be NVIDIA_VISIBLE_DEVICE=GPU#UUID. How do I solve this problem？

MY Environment:

Koordinator version: 1.5.0
Kubernetes version (use kubectl version): v1.26
containerd :1.7.20
OS: Ubuntu 22.04.4 LTS
GPU：One GPU servder with P40*4 -Linux k8s-node1 6.5.0-41-generic proxytime: 1.4.0

saintube commented 2 months ago

@ferris-cx It seems to be a problem where the runtime proxy or the device scheduling does not work well. Please show us the pod YAML when it requests koordinator.sh/gpu-core and gets scheduled at a node for more clues.

saintube commented 2 months ago

@ferris-cx As discussed offline, please describe your use cases for RDMAs and GPUs under #2181 if you have further device management needs.

ferris-cx commented 2 months ago

@saintube The problem has been resolved because the feature (GPUEnvInject) is not enabled, i.e. kc get pod koordlet-2mttc-n koordinator-system-oyaml, which is as follows:

containers:

args:
-addr=:9316
-cgroup-root-dir=/host-cgroup/
-feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true,GPUEnvInject=true
-runtime-hooks-host-endpoint=/var/run/koordlet/koordlet.sock
--logtostderr=true
--v=5

The GPUEnvInject=true parameter is disabled by default. I just added it manually, and the problem is fixed.

saintube commented 2 months ago

@saintube The problem has been resolved because the feature (GPUEnvInject) is not enabled, i.e.

kc get pod koordlet-2mttc-n koordinator-system-oyaml, which is as follows: containers:

args:

-addr=:9316

-cgroup-root-dir=/host-cgroup/

-feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true,GPUEnvInject=true

-runtime-hooks-host-endpoint=/var/run/koordlet/koordlet.sock

--logtostderr=true

--v=5

The GPUEnvInject=true parameter is disabled by default. I just added it manually, and the problem is fixed.

OK. This issue is closed. Please feel free to re-open it if you meet any other problems. /close

koordinator-bot[bot] commented 2 months ago

@saintube: Closing this issue.

In response to [this](https://github.com/koordinator-sh/koordinator/issues/2185#issuecomment-2324679055): >> # @saintube The problem has been resolved because the feature (GPUEnvInject) is not enabled, i.e. >> kc get pod koordlet-2mttc-n koordinator-system-oyaml, which is as follows: >> containers: >> >> * args: >> * -addr=:9316 >> * -cgroup-root-dir=/host-cgroup/ >> * -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true,GPUEnvInject=true >> * -runtime-hooks-host-endpoint=/var/run/koordlet/koordlet.sock >> * --logtostderr=true >> * --v=5 >> ==================== >> The GPUEnvInject=true parameter is disabled by default. I just added it manually, and the problem is fixed. > >OK. This issue is closed. Please feel free to re-open it if you meet any other problems. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

koordinator-sh / koordinator

[question] GPU visibility inside the pod does not take effect #2185

@saintube The problem has been resolved because the feature (GPUEnvInject) is not enabled, i.e. kc get pod koordlet-2mttc-n koordinator-system-oyaml, which is as follows:

--v=5

@saintube The problem has been resolved because the feature (GPUEnvInject) is not enabled, i.e.

--v=5