openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.11k stars 417 forks source link

Unexpected modprobe processes on RHEL9 CPU-only nodes using OpenMPI 5 with UCX built with CUDA #9997

Closed ZQyou closed 1 month ago

ZQyou commented 1 month ago

Describe the bug

I am not sure if this is a bug related to UCX, but I would like to understand more about it. I built OpenMPI 5 with UCX from HPC-X, where the UCX libraries were built with CUDA. When I ran any MPI application with OpenMPI on CPU nodes, I observed that there were modprobe processes running simultaneously with the MPI executable, occupying the allocated CPUs for minutes. The modprobe process is trying to load GPU modules. As a result, the actual job could only complete after the modprobe processes had finished. This issue occurs whenever MPI executables are launched and is only observed on RHEL9 but not on our other clusters, which run RHEL7.

Steps to Reproduce

gleon99 commented 1 month ago

Hi @ZQyou

It does not appear to be a UCX issue. Please try to look what is the command line of these processes, what launches them (parent) and why they take so long to complete.

If after that you are convinced it has anything to do with UCX, let us know.