BLD: compilation failure at comm_nccl.cu

tylerjereddy commented 1 week ago

On the LANL Venado machine, Linux ARM/Grace-Hopper architecture, whether using clang 18 (Cray clang version 18.0.0) or gcc-13 (13.2.1) compiler toolchain (both with nvcc from CUDA 12.5), the same compilation error arises for a recently-provided legate release (we only received a tarball--and the only version info I can find is CMakeLists.txt:set(legate_version 24.09.00), but this may be a dev version of that and not a tagged release yet). If you direct me to the appropriate location to grep out an embedded git hash I'll go ahead and do that for you, but I don't have a git bundle, just a preview release tarball as far as I can tell.

Here are the steps I follow on Venado:

Set up of environment and compilation commands

```bash cd /lustre/vescratch1/treddy/custom_nvidia/legate rm -rf arch-linux-cuda-release eval "$(/lustre/vescratch1/treddy/tyler_conda/conda_scratch/bin/conda shell.bash hook)" conda activate legate_custom set +o errexit set +e module load PrgEnv-gnu/8.5.0 export CC=gcc-13 export CXX=g++-13 export CPATH=/opt/cray/libfabric/1.20.1/include:$CPATH export LIBRARY_PATH=/opt/cray/libfabric/1.20.1/lib64:$LIBRARY_PATH export LD_LIBRARY_PATH=/opt/cray/libfabric/1.20.1/lib64:$LD_LIBRARY_PATH module load cudatoolkit/24.7_12.5 module load cray-hdf5-parallel/1.14.3.1 export LD_LIBRARY_PATH=/opt/cray/pe/mpich/8.1.30/ofi/crayclang/17.0/lib:$LD_LIBRARY_PATH export LIBRARY_PATH=/opt/cray/pe/mpich/8.1.30/ofi/crayclang/17.0/lib:$LIBRARY_PATH export PATH=$PATH:/opt/cray/pe/cce/18.0.0/bin export PATH=/opt/cray/libfabric/1.20.1/bin:$PATH ./configure --with-cuda --with-hdf5 --with-gasnet export LEGATE_ARCH='arch-linux-cuda-release' export LEGATE_DIR='/lustre/vescratch1/treddy/custom_nvidia/legate' make -j 64 ```

And here is the compilation failure (snipped at the end because the C++ compilation spam is after the error is a bit much):

``` [212/308] Building CXX object _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_analysis.cc.o In file included from /usr/include/c++/13/bits/specfun.h:43, from /usr/include/c++/13/cmath:3699, from /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/legion-src/runtime/legion/legion_analysis.cc:16: In static member function ‘static _Up* std::__copy_move<_IsMove, true, std::random_access_iterator_tag>::__copy_m(_Tp*, _Tp*, _Up*) [with _Tp = Legion::Internal::CopyFillAggregator::CopyUpdate*; _Up = Legion::Internal::CopyFillAggregator::CopyUpdate*; bool _IsMove = false]’, inlined from ‘_OI std::__copy_move_a2(_II, _II, _OI) [with bool _IsMove = false; _II = Legion::Internal::CopyFillAggregator::CopyUpdate**; _OI = Legion::Internal::CopyFillAggregator::CopyUpdate**]’ at /usr/include/c++/13/bits/stl_algobase.h:506:30, inlined from ‘_OI std::__copy_move_a1(_II, _II, _OI) [with bool _IsMove = false; _II = Legion::Internal::CopyFillAggregator::CopyUpdate**; _OI = Legion::Internal::CopyFillAggregator::CopyUpdate**]’ at /usr/include/c++/13/bits/stl_algobase.h:533:42, inlined from ‘_OI std::__copy_move_a(_II, _II, _OI) [with bool _IsMove = false; _II = __gnu_cxx::__normal_iterator >; _OI = Legion::Internal::CopyFillAggregator::CopyUpdate**]’ at /usr/include/c++/13/bits/stl_algobase.h:540:31, inlined from ‘_OI std::copy(_II, _II, _OI) [with _II = __gnu_cxx::__normal_iterator >; _OI = Legion::Internal::CopyFillAggregator::CopyUpdate**]’ at /usr/include/c++/13/bits/stl_algobase.h:633:7, inlined from ‘static _ForwardIterator std::__uninitialized_copy::__uninit_copy(_InputIterator, _InputIterator, _ForwardIterator) [with _InputIterator = __gnu_cxx::__normal_iterator >; _ForwardIterator = Legion::Internal::CopyFillAggregator::CopyUpdate**]’ at /usr/include/c++/13/bits/stl_uninitialized.h:147:27, inlined from ‘_ForwardIterator std::uninitialized_copy(_InputIterator, _InputIterator, _ForwardIterator) [with _InputIterator = __gnu_cxx::__normal_iterator >; _ForwardIterator = Legion::Internal::CopyFillAggregator::CopyUpdate**]’ at /usr/include/c++/13/bits/stl_uninitialized.h:185:15, inlined from ‘_ForwardIterator std::__uninitialized_copy_a(_InputIterator, _InputIterator, _ForwardIterator, allocator<_Tp>&) [with _InputIterator = __gnu_cxx::__normal_iterator >; _ForwardIterator = Legion::Internal::CopyFillAggregator::CopyUpdate**; _Tp = Legion::Internal::CopyFillAggregator::CopyUpdate*]’ at /usr/include/c++/13/bits/stl_uninitialized.h:373:37, inlined from ‘void std::vector<_Tp, _Alloc>::_M_range_insert(iterator, _ForwardIterator, _ForwardIterator, std::forward_iterator_tag) [with _ForwardIterator = __gnu_cxx::__normal_iterator >; _Tp = Legion::Internal::CopyFillAggregator::CopyUpdate*; _Alloc = std::allocator]’ at /usr/include/c++/13/bits/vector.tcc:814:38, inlined from ‘std::vector<_Tp, _Alloc>::iterator std::vector<_Tp, _Alloc>::insert(const_iterator, _InputIterator, _InputIterator) [with _InputIterator = __gnu_cxx::__normal_iterator >; = void; _Tp = Legion::Internal::CopyFillAggregator::CopyUpdate*; _Alloc = std::allocator]’ at /usr/include/c++/13/bits/stl_vector.h:1483:19, inlined from ‘void Legion::Internal::CopyFillAggregator::issue_copies(Legion::Internal::InstanceView*, std::map >&, std::set&, Legion::Internal::ApEvent, const Legion::Internal::FieldMask&, const Legion::Internal::PhysicalTraceInfo&, bool, bool, std::vector*)’ at /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/legion-src/runtime/legion/legion_analysis.cc:7339:28: /usr/include/c++/13/bits/stl_algobase.h:437:30: warning: ‘void* __builtin_memmove(void*, const void*, long unsigned int)’ writing between 9 and 9223372036854775800 bytes into a region of size 0 overflows the destination [-Wstringop-overflow=] 437 | __builtin_memmove(__result, __first, sizeof(_Tp) * _Num); | ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In file included from /usr/include/c++/13/aarch64-suse-linux/bits/c++allocator.h:33, from /usr/include/c++/13/bits/allocator.h:46, from /usr/include/c++/13/bits/stl_tree.h:64, from /usr/include/c++/13/map:62, from /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/legion-src/runtime/legion/legion_types.h:30, from /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/legion-src/runtime/legion.h:56, from /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/legion-src/runtime/legion/legion_analysis.cc:17: In member function ‘_Tp* std::__new_allocator<_Tp>::allocate(size_type, const void*) [with _Tp = Legion::Internal::InstanceView*]’, inlined from ‘static _Tp* std::allocator_traits >::allocate(allocator_type&, size_type) [with _Tp = Legion::Internal::InstanceView*]’ at /usr/include/c++/13/bits/alloc_traits.h:482:28, inlined from ‘std::_Vector_base<_Tp, _Alloc>::pointer std::_Vector_base<_Tp, _Alloc>::_M_allocate(std::size_t) [with _Tp = Legion::Internal::InstanceView*; _Alloc = std::allocator]’ at /usr/include/c++/13/bits/stl_vector.h:378:33, inlined from ‘std::_Vector_base<_Tp, _Alloc>::pointer std::_Vector_base<_Tp, _Alloc>::_M_allocate(std::size_t) [with _Tp = Legion::Internal::CopyFillAggregator::CopyUpdate*; _Alloc = std::allocator]’ at /usr/include/c++/13/bits/stl_vector.h:375:7, inlined from ‘void std::vector<_Tp, _Alloc>::_M_range_insert(iterator, _ForwardIterator, _ForwardIterator, std::forward_iterator_tag) [with _ForwardIterator = __gnu_cxx::__normal_iterator >; _Tp = Legion::Internal::CopyFillAggregator::CopyUpdate*; _Alloc = std::allocator]’ at /usr/include/c++/13/bits/vector.tcc:805:40, inlined from ‘std::vector<_Tp, _Alloc>::iterator std::vector<_Tp, _Alloc>::insert(const_iterator, _InputIterator, _InputIterator) [with _InputIterator = __gnu_cxx::__normal_iterator >; = void; _Tp = Legion::Internal::CopyFillAggregator::CopyUpdate*; _Alloc = std::allocator]’ at /usr/include/c++/13/bits/stl_vector.h:1483:19, inlined from ‘void Legion::Internal::CopyFillAggregator::issue_copies(Legion::Internal::InstanceView*, std::map >&, std::set&, Legion::Internal::ApEvent, const Legion::Internal::FieldMask&, const Legion::Internal::PhysicalTraceInfo&, bool, bool, std::vector*)’ at /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/legion-src/runtime/legion/legion_analysis.cc:7339:28: /usr/include/c++/13/bits/new_allocator.h:151:55: note: at offset [-9223372036854775808, -1] into destination object of size [8, 9223372036854775800] allocated by ‘operator new’ 151 | return static_cast<_Tp*>(_GLIBCXX_OPERATOR_NEW(__n * sizeof(_Tp))); | ^ [296/308] Building CUDA object src/cpp/CMakeFiles/legate.dir/legate/comm/detail/comm_nccl.cu.o FAILED: src/cpp/CMakeFiles/legate.dir/legate/comm/detail/comm_nccl.cu.o /opt/nvidia/hpc_sdk/Linux_aarch64/24.7/cuda/12.5/bin/nvcc -forward-unknown-to-host-compiler -DFMT_SHARED -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -DUSE_CUDA -DUSE_HDF -Dlegate_EXPORTS -I/lustre/vescratch1/treddy/custom_nvidia/legate/src/cpp -I/lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/src/cpp/include/legate -I/lustre/vescratch1/treddy/custom_nvidia/legate/share/legate/mpi_wrapper/src -isystem /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/cccl-src/thrust/thrust/cmake/../.. -isystem /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/cccl-src/libcudacxx/lib/cmake/libcudacxx/../../../include -isystem /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/cccl-src/cub/cub/cmake/../.. -isystem /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/legion-src/runtime -isystem /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/legion-src/runtime/mappers -isystem /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/legion-build/runtime -isystem /opt/nvidia/hpc_sdk/Linux_aarch64/24.7/cuda/12.5/targets/sbsa-linux/include -isystem /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/mdspan-src/include -isystem /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/span-src/include -isystem /lustre/vescratch1/treddy/tyler_conda/conda_scratch/envs/legate_custom/include -isystem /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/fmt-src/include -isystem /lustre/vescratch1/treddy/custom_nvidia/legate/arch-linux-cuda-release/cmake_build/_deps/argparse-src/include --compiler-options=-O3 -O2 -std=c++17 -arch=all-major -Xcompiler=-fPIC -Xfatbin=-compress-all --expt-extended-lambda --expt-relaxed-constexpr -Wno-deprecated-gpu-targets -MD -MT src/cpp/CMakeFiles/legate.dir/legate/comm/detail/comm_nccl.cu.o -MF src/cpp/CMakeFiles/legate.dir/legate/comm/detail/comm_nccl.cu.o.d -x cu -c /lustre/vescratch1/treddy/custom_nvidia/legate/src/cpp/legate/comm/detail/comm_nccl.cu -o src/cpp/CMakeFiles/legate.dir/legate/comm/detail/comm_nccl.cu.o /lustre/vescratch1/treddy/custom_nvidia/legate/src/cpp/legate/task/variant_helper.h: In instantiation of ‘static void legate::detail::VariantHelper::record(const legate::Library&, legate::TaskInfo*, const std::map&) [with T = legate::detail::comm::nccl::InitId; SELECTOR = legate::detail::GPUVariant]’: /lustre/vescratch1/treddy/custom_nvidia/legate/src/cpp/legate/task/task.inl:55:64: required from ‘static std::unique_ptr legate::LegateTask::create_task_info_(const legate::Library&, const std::map&) [with T = legate::detail::comm::nccl::InitId]’ /lustre/vescratch1/treddy/custom_nvidia/legate/src/cpp/legate/task/task.inl:44:37: required from ‘static void legate::LegateTask::register_variants(legate::Library, legate::LocalTaskID, const std::map&) [with T = legate::detail::comm::nccl::InitId]’ /lustre/vescratch1/treddy/custom_nvidia/legate/src/cpp/legate/task/task.inl:37:18: required from ‘static void legate::LegateTask::register_variants(legate::Library, const std::map&) [with T = legate::detail::comm::nccl::InitId]’ /lustre/vescratch1/treddy/custom_nvidia/legate/src/cpp/legate/comm/detail/comm_nccl.cu:277:56: required from here /lustre/vescratch1/treddy/custom_nvidia/legate/src/cpp/legate/task/variant_helper.h:133:16: error: unable to deduce ‘const auto’ from ‘task_wrapper_ >&, Legion::Internal::TaskContext*, Legion::Runtime*), const Legion::Task*, const std::vector >&, Legion::Internal::TaskContext*, Legion::Runtime*>, variant_impl, variant_kind>’ constexpr auto entry = T::BASE::template task_wrapper_; ```

lightsighter commented 6 days ago

You can ignore the warning for the legion_analysis.cc translation unit. It is a bug with the -Wstringop-overflow static analysis which is present in many compilers. You can read more about it here.

The real problem is this:

/lustre/vescratch1/treddy/custom_nvidia/legate/src/cpp/legate/task/variant_helper.h:133:16: error: unable to deduce ‘const auto’ from ‘task_wrapper_<std::invoke_result_t<ncclUniqueId (* const)(const Legion::Task*, const std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> >&, Legion::Internal::TaskContext*, Legion::Runtime*), const Legion::Task*, const std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> >&, Legion::Internal::TaskContext*, Legion::Runtime*>, variant_impl, variant_kind>’
       constexpr auto entry = T::BASE::template task_wrapper_<RET, variant_impl, variant_kind>;

manopapad commented 4 days ago

@tylerjereddy could you please try replacing constexpr auto entry with constexpr Processor::TaskFuncPtr entry?

tylerjereddy commented 3 days ago

Will do, Venado is down for another day or two I think (this time for a dedicated activity time/reservation I think).

marcinz commented 1 day ago

@tylerjereddy Does the compiler provide any notes after the error?

tylerjereddy commented 1 day ago

A few thousand lines of C++ spam follow the error IIRC (sorry C++ devs..), but I can share the full log once Venado comes back up if you want.

nv-legate / legate.core

BLD: compilation failure at comm_nccl.cu #959