microsoft / msccl

Microsoft Collective Communication Library
Other
314 stars 31 forks source link

Questions about MSCCL's building error #49

Closed tjdgh0715 closed 1 year ago

tjdgh0715 commented 1 year ago

Hello MSCCL team,

Thanks for the excellent work. I have some issues when building MSCCL in my own environment.

I am currently using an ubuntu 18.04 machine with GPUs connected with PCIe, not NVLINK. I tried to build msccl on two machines: the first one has 2 x V100 32GB, and the second one has 2 x A5000 GPUs. Both of them are compiled with CUDA 11.1, and they are set as the default cuda path.

However, when I tried to build MSCCL following the guideline of the official repo, my script got freeze with lots of warnings, and it fails. (I tried to build via source & cloning the git repository, and neither of them has succeeded.)

I tried to solve it by referencing the previous build error issues, but it seems to be not working with my situation. Also, I am wondering if the MSCCL is compatible only with the system with NVLINK, but not sure about it.

I've attached some error logs (the errors I got when building via source zip file & cloning the git repo). Can I get some advice on my issue?

error_msccl_git_build.log error_msccl_source_build.log

tjdgh0715 commented 1 year ago

It was due to the NCCL-CUDA version problem. I built NCCL via source which is compatible with my cuda version, and MSCCL seems to work well.