openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
201 stars 97 forks source link

ibucc_tl_cuda.so: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType #496

Open zasdfgbnm opened 2 years ago

zasdfgbnm commented 2 years ago

I am seeing this error:

libucc_tl_cuda.so: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType

Thanks to @crcrpar who figured out that this is a new API https://github.com/NVIDIA/nvidia-settings/blame/5b455b89bb73f56818c84444806bc9c928da67ac/src/nvml.h#L6009-L6026

For older versions of drivers, is it possible to use other APIs to achieve similar functionality? Or at least detect the version and throw a kinder error message?

cc: @ptrblck

jladd-mlnx commented 2 years ago

@bureddy Can you take a look, please.

vspetrov commented 2 years ago

Hi @zasdfgbnm actually existing autotool code does check for the presence of that function at compile time. Here: https://github.com/openucx/ucc/blob/e96a6de3def951748a8c1bd9f3d074f73c594f1f/config/m4/cuda.m4#L79. So i guess it was available during compile time and in your case it is not available at runtime. This implies compile/runtime cuda versions mismatch. Could you plz check the env and confirm?

crcrpar commented 2 years ago

We're seeing the undefined symbol message when we run a container which has CUDA 11.6 on a host with an older driver

bureddy commented 2 years ago

what is the driver version? is it possible to choose the right cuda toolkit version in container? https://docs.nvidia.com/deploy/cuda-compatibility/index.html otherwise, I think you need to have cuda-compat-11.6 in the container for compatibility.

ptrblck commented 2 years ago

The KMD was 460.73.01, UMD 510.47.03, and forward compat was used.

bureddy commented 2 years ago

It seems no forward compat for NVML (libnvidia-ml.so) unfortunately.

crcrpar commented 2 years ago

@bureddy what do you think about @zasdfgbnm's 2nd question?

Or at least detect the version and throw a kinder error message?