tkestack / vcuda-controller

Other
488 stars 156 forks source link

Problems caused by launching multiple pods at the same time #25

Open pidb opened 2 years ago

pidb commented 2 years ago

Why do I get an error when I start multiple GPU-resource pods simultaneously (concurrently) using vcuda?

In vcuda loader.c, I add ferror to print errno related error message, I get it

image

But when I start the pods sequentially, I don't have this problem. So I guess it may be caused by a gap between the kubelet startup container and the gpu-manager placing the libcuda.so file.

pidb commented 2 years ago

cc @mYmNeo

rainfd commented 2 years ago

As if vcuda would copy lib to container after container start up. When start multiple GPU-resource pods simultaneously, this action is not fast enough. You can try to modify your command sh -c sleep 5 && your command

pidb commented 2 years ago

As if vcuda would copy lib to container after container start up. When start multiple GPU-resource pods simultaneously, this action is not fast enough. You can try to modify your command sh -c sleep 5 && your command

Oh, Thanks rainfd, I knew this solution, but I felt this way is a hat trick.

mYmNeo commented 2 years ago

What's the version of gpu-manager? I've fixed a problem in master branch but not released a image

rainfd commented 2 years ago

@mYmNeo my version is v1.0.4. What is the commit?

mYmNeo commented 2 years ago

@mYmNeo my version is v1.0.4. What is the commit?

https://github.com/tkestack/gpu-manager/pull/130