microsoft / mscclpp

MSCCL++: A GPU-driven communication stack for scalable AI applications
MIT License
246 stars 38 forks source link

[Bug] libmscclpp_nccl fails linking using ROCm 6.0 #349

Closed corey-derochie-amd closed 1 month ago

corey-derochie-amd commented 1 month ago

While commit 72b99a42291fcd6c5dcde694fcb3c5d72bc0c9c7 allows libmscclpp to compile using ROCm 6.0, there are still linker errors in libmscclpp_nccl:

ld.lld: error: duplicate symbol: __float2bfloat16(float)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__float2bfloat16(float))
>>> defined at allreduce.hpp
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x0)

ld.lld: error: duplicate symbol: __bfloat1622float2(__hip_bfloat162)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__bfloat1622float2(__hip_bfloat162))
>>> defined at allreduce.hpp
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x40)

ld.lld: error: duplicate symbol: __double2bfloat16(double)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__double2bfloat16(double))
>>> defined at allreduce.hpp
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x60)

ld.lld: error: duplicate symbol: __float22bfloat162_rn(HIP_vector_type<float, 2u>)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__float22bfloat162_rn(HIP_vector_type<float, 2u>))
>>> defined at allreduce.hpp
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0xA0)

ld.lld: error: duplicate symbol: __high2float(__hip_bfloat162)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__high2float(__hip_bfloat162))
>>> defined at allreduce.hpp
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x120)

ld.lld: error: duplicate symbol: __low2float(__hip_bfloat162)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__low2float(__hip_bfloat162))
>>> defined at allreduce.hpp
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x130)

ld.lld: error: duplicate symbol: __float2bfloat16(float)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__float2bfloat16(float))
>>> defined at nccl.cu
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x0)

ld.lld: error: duplicate symbol: __bfloat1622float2(__hip_bfloat162)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__bfloat1622float2(__hip_bfloat162))
>>> defined at nccl.cu
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x40)

ld.lld: error: duplicate symbol: __double2bfloat16(double)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__double2bfloat16(double))
>>> defined at nccl.cu
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x60)

ld.lld: error: duplicate symbol: __float22bfloat162_rn(HIP_vector_type<float, 2u>)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__float22bfloat162_rn(HIP_vector_type<float, 2u>))
>>> defined at nccl.cu
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0xA0)

ld.lld: error: duplicate symbol: __high2float(__hip_bfloat162)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__high2float(__hip_bfloat162))
>>> defined at nccl.cu
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x120)

ld.lld: error: duplicate symbol: __low2float(__hip_bfloat162)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__low2float(__hip_bfloat162))
>>> defined at nccl.cu
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x130)
clang++: error: linker command failed with exit code 1 (use -v to see invocation)
gmake[5]: *** [apps/nccl/CMakeFiles/mscclpp_nccl.dir/build.make:145: apps/nccl/libmscclpp_nccl.so.0.5.2] Error 1
gmake[4]: *** [CMakeFiles/Makefile2:379: apps/nccl/CMakeFiles/mscclpp_nccl.dir/all] Error 2
gmake[4]: *** Waiting for unfinished jobs....
[100%] Built target check-format-cpp
gmake[3]: *** [Makefile:139: all] Error 2
gmake[2]: *** [CMakeFiles/mscclpp_nccl-download.dir/build.make:86: mscclpp_nccl-download-prefix/src/mscclpp_nccl-download-stamp/mscclpp_nccl-download-build] Error 2
gmake[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/mscclpp_nccl-download.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2

This does not appear to be an issue with later versions of ROCm.

chhwang commented 1 month ago

Hi @corey-derochie-amd, the team has investigated this from before, and it is very tricky to tackle from the mscclpp's side. We rather use this ROCm patch for include/hip/amd_detail/amd_hip_bf16.h to avoid this issue on ROCm 6.0.

97c97
< #define __HOST_DEVICE__ __device__
---
> #define __HOST_DEVICE__ __device__ static
100c100
< #define __HOST_DEVICE__ __host__ __device__
---
> #define __HOST_DEVICE__ __host__ __device__ static inline

This is already adopted in ROCm 6.1.

corey-derochie-amd commented 1 month ago

Thanks, @chhwang .