Open zasdfgbnm opened 2 years ago
@bureddy Can you take a look, please.
Hi @zasdfgbnm actually existing autotool code does check for the presence of that function at compile time. Here: https://github.com/openucx/ucc/blob/e96a6de3def951748a8c1bd9f3d074f73c594f1f/config/m4/cuda.m4#L79. So i guess it was available during compile time and in your case it is not available at runtime. This implies compile/runtime cuda versions mismatch. Could you plz check the env and confirm?
We're seeing the undefined symbol message when we run a container which has CUDA 11.6 on a host with an older driver
what is the driver version? is it possible to choose the right cuda toolkit version in container?
https://docs.nvidia.com/deploy/cuda-compatibility/index.html
otherwise, I think you need to have cuda-compat-11.6
in the container for compatibility.
The KMD was 460.73.01, UMD 510.47.03, and forward compat was used.
It seems no forward compat for NVML (libnvidia-ml.so) unfortunately.
@bureddy what do you think about @zasdfgbnm's 2nd question?
Or at least detect the version and throw a kinder error message?
I am seeing this error:
Thanks to @crcrpar who figured out that this is a new API https://github.com/NVIDIA/nvidia-settings/blame/5b455b89bb73f56818c84444806bc9c928da67ac/src/nvml.h#L6009-L6026
For older versions of drivers, is it possible to use other APIs to achieve similar functionality? Or at least detect the version and throw a kinder error message?
cc: @ptrblck