Open guofengtd opened 2 years ago
looks like https://github.com/tkestack/vcuda-controller/blob/master/src/loader.c#L943 cant parse containerd's cgroup
@paragor so gpu-manager cant support containerd?
nvidia hook (nvidia-container-runtime) disabled, right?
i can prepare MR to fix vanilla containerd
@paragor I can create gpu pod by gpu-manager, but when i execute the command nvidia-smi
in the gpu pod, i get the following errors: F0627 09:09:43.017690 18 client.go:78] fail to get response from manager, error rpc error: code = Unknown desc = can't find containerd-918d8cef52c37c77df775137619b3ccffcf626cdd0d57d71a82d6113a6df5dcd from docker /tmp/cuda-control/src/register.c:87 rpc client exit with 255
Do you know how to solve this problem?
@guofengtd My cgroup driver is systemd, runtime is containerd and the version of kubernetes is 1.23
1 minutes
@seanchen022 please, try this fix https://github.com/tkestack/gpu-manager/compare/master...paragor:master
i will have gpu cluster only after 4 days for tests :(
@seanchen022 i had the same problem like u days ago.
when i change the images/codes of gpu-manager newer, it solved.
maybe the branch master, commit: f0669de works.
@paragor by the way, have u tried it ur way?
hello guys,
when I create a deployment with one GPU card, everything goes well
but, if with GPU that less than 1, it fails
bellow are the logs and node info, please kindly review them, and supply any suggestion or solution, thanks.
deployment log
node info
GPU-Manager log