Closed ferris-cx closed 2 months ago
@ferris-cx It seems to be a problem where the runtime proxy or the device scheduling does not work well.
Please show us the pod YAML when it requests koordinator.sh/gpu-core
and gets scheduled at a node for more clues.
@ferris-cx As discussed offline, please describe your use cases for RDMAs and GPUs under #2181 if you have further device management needs.
containers:
The GPUEnvInject=true parameter is disabled by default. I just added it manually, and the problem is fixed.
@saintube The problem has been resolved because the feature (GPUEnvInject) is not enabled, i.e.
kc get pod koordlet-2mttc-n koordinator-system-oyaml, which is as follows: containers:
- args:
- -addr=:9316
- -cgroup-root-dir=/host-cgroup/
- -feature-gates=BECPUEvict=true,BEMemoryEvict=true,CgroupReconcile=true,Accelerators=true,GPUEnvInject=true
- -runtime-hooks-host-endpoint=/var/run/koordlet/koordlet.sock
- --logtostderr=true
--v=5
The GPUEnvInject=true parameter is disabled by default. I just added it manually, and the problem is fixed.
OK. This issue is closed. Please feel free to re-open it if you meet any other problems. /close
@saintube: Closing this issue.
Create a POD to apply for GPU, enter the container through kubectl exec, execute nvidia-smi, I can see all GPU (4) device information,。 I should only see one GPU device that has been assigned to this Pod The GPU visibility is not limited, the reason should be the container internal environment variable env: NVIDIA_VISIBLE_DEVICE=ALL, it should be NVIDIA_VISIBLE_DEVICE=GPU#UUID. How do I solve this problem?
MY Environment: