Open pidb opened 2 years ago
cc @mYmNeo
As if vcuda would copy lib to container after container start up. When start multiple GPU-resource pods simultaneously, this action is not fast enough. You can try to modify your command sh -c sleep 5 && your command
As if vcuda would copy lib to container after container start up. When start multiple GPU-resource pods simultaneously, this action is not fast enough. You can try to modify your command
sh -c sleep 5 && your command
Oh, Thanks rainfd, I knew this solution, but I felt this way is a hat trick.
What's the version of gpu-manager? I've fixed a problem in master branch but not released a image
@mYmNeo my version is v1.0.4. What is the commit?
@mYmNeo my version is v1.0.4. What is the commit?
Why do I get an error when I start multiple GPU-resource pods simultaneously (concurrently) using vcuda?
In vcuda loader.c, I add
ferror
to printerrno
related error message, I get itBut when I start the pods sequentially, I don't have this problem. So I guess it may be caused by a gap between the kubelet startup container and the gpu-manager placing the libcuda.so file.