oneapi-src / oneCCL

oneAPI Collective Communications Library (oneCCL)
https://oneapi-src.github.io/oneCCL
Other
193 stars 70 forks source link

Compile error on the master branch #54

Closed chengjunlu closed 2 years ago

chengjunlu commented 3 years ago

Env: Ubuntu 20.04 GCC-10

First error:

torch-ccl/third_party/oneCCL/src/atl/util/pm/pmi_resizable_rt/pmi_resizable_simple.h:124:17: error: field ‘my_proccess_name’ has incomplete type ‘std::string’ {aka ‘std::__cxx11::basic_string’} 124 | std::string my_proccess_name;

After fix the issue by adding "#include < string >" in pmi_resizable_simple.h file, another error happened.

torch-ccl/third_party/oneCCL/src/comp/bf16/bf16_intrisics.hpp:74:82: note: use ‘-flax-vector-conversions’ to permit conversions between vectors with differing element types or numbers of subparts 74 _mm256_storeu_si256((__m256i*)(dst), _mm512_cvtneps_pbh(_mm512_loadu_ps(src))); t torch-ccl/third_party/oneCCL/src/comp/bf16/bf16_intrisics.hpp:74:60: error: cannot convert ‘m256bh’ to ‘m256i’ 74 _mm256_storeu_si256((__m256i*)(dst), _mm512_cvtneps_pbh(_mm512_loadu_ps(src))); ~~~~^~~~~~~~
__m256bh

Is there any suggestion?

zhongyuansh commented 3 years ago

I have fixed the another error: "error: cannot convert ‘m256bh’ to ‘m256i’"

I added the "-flax-vector-conversions" into CMAKE_CXX_FLAGS in the file torch-ccl/third_party/oneCCL/CMakeLists.txt: set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_COMPILER_FLAGS} -Wall -Werror -D_GNU_SOURCE -flax-vector-conversions -std=c++14 -fvisibility=internal")

But master branch still have lots of compile errors So, I try to switch to another branch such as "remotes/origin/ccl_torch1.7", then it could be compiled successfully

master branch compile errors snippet as following, you could reproduce it in your env(ubuntu 20.04, python 3.8, anaconda, gcc-10)

In file included from /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.cpp:32:
/home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.hpp:81:42: error: invalid covariant return type for ‘virtual c10::intrusive_ptr c10d::ProcessGroupCCL::b roadcast(std::vector&, const c10d::BroadcastOptions&)’
81 | c10::intrusive_ptr broadcast(
| ^~~~~
In file included from /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.hpp:40,
from /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.cpp:32:
/home/mark/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include/c10d/ProcessGroup.hpp:134:47: note: overridden function is ‘virtual std::shared_ptr c10d::ProcessGroup::broad cast(std::vector&, const c10d::BroadcastOptions&)’
134 | virtual std::shared_ptr broadcast(
| ^~~~~
In file included from /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.cpp:32:
/home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.hpp:85:42: error: invalid covariant return type for ‘virtual c10::intrusive_ptr c10d::ProcessGroupCCL::a llreduce(std::vector&, const c10d::AllreduceOptions&)’
85 | c10::intrusive_ptr allreduce(
| ^~~~~
In file included from /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.hpp:40, from /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.cpp:32: /home/mark/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include/c10d/ProcessGroup.hpp:138:47: note: overridden function is ‘virtual std::shared_ptr c10d::ProcessGroup::allre duce(std::vector&, const c10d::AllreduceOptions&)’
138 | virtual std::shared_ptr allreduce(
| ^~~~~ In file included from /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.cpp:32: /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.hpp:89:42: error: invalid covariant return type for ‘virtual c10::intrusive_ptr c10d::ProcessGroupCCL::a llreduce_coalesced(std::vector&, const c10d::AllreduceCoalescedOptions&)’ 89 | c10::intrusive_ptr allreduce_coalesced(

zhongyuansh commented 3 years ago

BTW, This issue is founded when I am building IPEX for pytorch Due to the torch_ccl compile error, IPEX couldn't be compiled successfully. Then I try to build torch_ccl independently, the same issue is aslo exist in master branch but not in ccl_torch1.7 branch So, I update the two files of IPEX by manually, then, IPEX could be build successfully, the two files as following(mentionded before): [ipex directory]/third_party/torch-ccl/third_party/oneCCL/CMakeLists.txt [ipex directory]/third_party/torch-ccl/third_party/oneCCL/src/atl/util/pm/pmi_resizable_rt/pmi_resizable_simple.h

I checked the IPEX git info, the IPEX's torch_ccl oneCCL checkout from the following, not the latest master branch: "Submodule path 'third_party/torch_ccl/third_party/oneCCL': checked out '751b1b0c00525aa685a4f3528435b2d0eb3c53a0"

So, maybe, there are some bugs in the latest updated content of master branch.

chengjunlu commented 3 years ago

I have fixed the another error: "error: cannot convert ‘m256bh’ to ‘m256i’"

I added the "-flax-vector-conversions" into CMAKE_CXX_FLAGS in the file torch-ccl/third_party/oneCCL/CMakeLists.txt: set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_COMPILER_FLAGS} -Wall -Werror -D_GNU_SOURCE -flax-vector-conversions -std=c++14 -fvisibility=internal")

But master branch still have lots of compile errors So, I try to switch to another branch such as "remotes/origin/ccl_torch1.7", then it could be compiled successfully

master branch compile errors snippet as following, you could reproduce it in your env(ubuntu 20.04, python 3.8, anaconda, gcc-10)

In file included from /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.cpp:32: /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.hpp:81:42: error: invalid covariant return type for ‘virtual c10::intrusive_ptrc10d::ProcessGroup::Work c10d::ProcessGroupCCL::b roadcast(std::vectorat::Tensor&, const c10d::BroadcastOptions&)’ 81 | c10::intrusive_ptrProcessGroup::Work broadcast( | ^~~~~ In file included from /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.hpp:40, from /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.cpp:32: /home/mark/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include/c10d/ProcessGroup.hpp:134:47: note: overridden function is ‘virtual std::shared_ptrc10d::ProcessGroup::Work c10d::ProcessGroup::broad cast(std::vectorat::Tensor&, const c10d::BroadcastOptions&)’ 134 | virtual std::shared_ptrProcessGroup::Work broadcast( | ^~~~~ In file included from /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.cpp:32: /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.hpp:85:42: error: invalid covariant return type for ‘virtual c10::intrusive_ptrc10d::ProcessGroup::Work c10d::ProcessGroupCCL::a llreduce(std::vectorat::Tensor&, const c10d::AllreduceOptions&)’ 85 | c10::intrusive_ptrProcessGroup::Work allreduce( | ^~~~~ In file included from /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.hpp:40, from /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.cpp:32: /home/mark/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/include/c10d/ProcessGroup.hpp:138:47: note: overridden function is ‘virtual std::shared_ptrc10d::ProcessGroup::Work c10d::ProcessGroup::allre duce(std::vectorat::Tensor&, const c10d::AllreduceOptions&)’ 138 | virtual std::shared_ptrProcessGroup::Work allreduce( | ^~~~~ In file included from /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.cpp:32: /home/mark/work/pytorch/env-pytorch/torch-ccl-test/torch-ccl/src/ProcessGroupCCL.hpp:89:42: error: invalid covariant return type for ‘virtual c10::intrusive_ptrc10d::ProcessGroup::Work c10d::ProcessGroupCCL::a llreduce_coalesced(std::vectorat::Tensor&, const c10d::AllreduceCoalescedOptions&)’ 89 | c10::intrusive_ptrProcessGroup::Work allreduce_coalesced(

Thanks a lot for the information and effort to investigate the compile error of oneCCL.

It seems you have successfully built the oneCCL library. These error because the torch version is not compatible with torch_ccl. The master branch torch_ccl only support the master pytorch in GITHUB.

We have different torch_ccl branches for different pytorch version.

Please use the master pytorch with all the fixes you have for the oneCCL. Or try the different torch_ccl branch for your pytorch release.

zhongyuansh commented 3 years ago

ok, thanks. I will try it

mshiryaev commented 2 years ago

@zhongyuansh - do you still observe compile issue with the latest CCL code?

mshiryaev commented 2 years ago

Two build issues mentioned above (related with std::string and m256bh/m256i conversion) should be fixed in the latest CCL code on master branch, so closing this ticket. Please re-open or create new ticket in case if new build issues will be exposed.