koordinator-sh / koordinator

A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, etc.
https://koordinator.sh
Apache License 2.0
1.36k stars 331 forks source link

[question] GPU visibility inside the pod does not take effect #2185

Closed ferris-cx closed 2 months ago

ferris-cx commented 2 months ago

Create a POD to apply for GPU, enter the container through kubectl exec, execute nvidia-smi, I can see all GPU (4) device information,。 I should only see one GPU device that has been assigned to this Pod The GPU visibility is not limited, the reason should be the container internal environment variable env: NVIDIA_VISIBLE_DEVICE=ALL, it should be NVIDIA_VISIBLE_DEVICE=GPU#UUID. How do I solve this problem?

MY Environment:

saintube commented 2 months ago

@ferris-cx It seems to be a problem where the runtime proxy or the device scheduling does not work well. Please show us the pod YAML when it requests koordinator.sh/gpu-core and gets scheduled at a node for more clues.

saintube commented 2 months ago

@ferris-cx As discussed offline, please describe your use cases for RDMAs and GPUs under #2181 if you have further device management needs.

ferris-cx commented 2 months ago

@saintube The problem has been resolved because the feature (GPUEnvInject) is not enabled, i.e. kc get pod koordlet-2mttc-n koordinator-system-oyaml, which is as follows:

containers:

saintube commented 2 months ago

@saintube The problem has been resolved because the feature (GPUEnvInject) is not enabled, i.e.

kc get pod koordlet-2mttc-n koordinator-system-oyaml, which is as follows: containers:

  • args:
  • -addr=:9316
  • -cgroup-root-dir=/host-cgroup/
  • -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true,GPUEnvInject=true
  • -runtime-hooks-host-endpoint=/var/run/koordlet/koordlet.sock
  • --logtostderr=true
  • --v=5

    The GPUEnvInject=true parameter is disabled by default. I just added it manually, and the problem is fixed.

OK. This issue is closed. Please feel free to re-open it if you meet any other problems. /close

koordinator-bot[bot] commented 2 months ago

@saintube: Closing this issue.

In response to [this](https://github.com/koordinator-sh/koordinator/issues/2185#issuecomment-2324679055): >> # @saintube The problem has been resolved because the feature (GPUEnvInject) is not enabled, i.e. >> kc get pod koordlet-2mttc-n koordinator-system-oyaml, which is as follows: >> containers: >> >> * args: >> * -addr=:9316 >> * -cgroup-root-dir=/host-cgroup/ >> * -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true,GPUEnvInject=true >> * -runtime-hooks-host-endpoint=/var/run/koordlet/koordlet.sock >> * --logtostderr=true >> * --v=5 >> ==================== >> The GPUEnvInject=true parameter is disabled by default. I just added it manually, and the problem is fixed. > >OK. This issue is closed. Please feel free to re-open it if you meet any other problems. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.