Open CharlelieLrt opened 8 months ago
Can you add the following flag to your command line -ll:force_kthreads -lg:inorder -lg:safe_ctrlrepl 1
and then attach a debugger to the process on each node and report the results of thread apply all bt
from each node?
If possible build the Legate core with --debug
before doing that so we can get line numbers for the backtraces.
Trying to build legate core with --debug
gives me the error below. It builds without problem without the debug option.
FAILED: _deps/legion-build/lib/liblegion.so.1
: && /usr/tce/packages/gcc/gcc-8.3.1/bin/c++ -fPIC -mcpu=native -maltivec -mabi=altivec -mvsx -O0 -g -shared -Wl,-soname,liblegion.so.1 -o _deps/legion-build/lib/liblegion.so.1 _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/mappers/default_mapper.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/mappers/mapping_utilities.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/mappers/shim_mapper.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/mappers/test_mapper.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/mappers/null_mapper.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/mappers/replay_mapper.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/mappers/debug_mapper.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/mappers/wrapper_mapper.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/mappers/forwarding_mapper.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/mappers/logging_wrapper.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/garbage_collection.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/index_space_value.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_analysis.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_c.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_constraint.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_context.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_instances.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_mapping.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_ops.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_profiling.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_profiling_serializer.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_replication.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_spy.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_tasks.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_trace.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_views.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_redop.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/mapper_manager.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/runtime.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/legion_redop.cu.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_1.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_1_1.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_1_2.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_1_3.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_1_4.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_1_5.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_2.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_2_1.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_2_2.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_2_3.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_2_4.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_2_5.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_3.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_3_1.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_3_2.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_3_3.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_3_4.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_3_5.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_4.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_4_1.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_4_2.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_4_3.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_4_4.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_4_5.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_1.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_2.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_3.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_4.cc.o _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o -L/usr/tce/packages/cuda/cuda-12.0.0/nvidia/lib64 -L/usr/tce/packages/cuda/cuda-12.0.0/nvidia/targets/ppc64le-linux/lib/stubs -L/usr/tce/packages/cuda/cuda-12.0.0/nvidia/targets/ppc64le-linux/lib -Wl,-rpath,"\$ORIGIN:/g/g92/laurent3/miniforge3/envs/legate_01302024_DEBUG/lib:/usr/tce/packages/cuda/cuda-12.0.0/nvidia/lib64:/usr/WS1/laurent3/Codes/LEGATE/legate_01302024_DEBUG.core/_skbuild/linux-ppc64le-3.10/cmake-build/_deps/legion-build/lib:/usr/tce/packages/cuda/cuda-12.0.0/lib64:/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib:/usr/tce/packages/cuda/cuda-12.0.0/nvidia/targets/ppc64le-linux/lib:" _deps/legion-build/lib/librealm.so.1 /g/g92/laurent3/miniforge3/envs/legate_01302024_DEBUG/lib/libz.so _deps/legion-build/embed-gasnet/install/lib/libgasnet-ibv-par.a _deps/legion-build/embed-gasnet/install/lib/libgasnet-ibv-par.a /usr/lib64/libibverbs.so /usr/lib64/libhwloc.so /usr/tce/packages/cuda/cuda-12.0.0/lib64/libcuda.so -lpthread /usr/lib64/librt.so /usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8/libgcc.a /usr/lib64/libm.so /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpiprofilesupport.so /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpi_ibm.so /g/g92/laurent3/miniforge3/envs/legate_01302024_DEBUG/lib/libcudart.so /usr/tce/packages/cuda/cuda-12.0.0/nvidia/targets/ppc64le-linux/lib/libcuda.so -lcudadevrt -lcudart && :
_deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o: In function `Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> >* std::__uninitialized_move_if_noexcept_a<Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> >*, Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> >*, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> > > >(Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> >*, Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> >*, Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> >*, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> > >&)':
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/stl_uninitialized.h:311:(.text._ZSt34__uninitialized_move_if_noexcept_aIPN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EjEENS0_4RectILi5ExEEEES7_SaIS6_EET0_T_SA_S9_RT1_[_ZSt34__uninitialized_move_if_noexcept_aIPN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EjEENS0_4RectILi5ExEEEES7_SaIS6_EET0_T_SA_S9_RT1_]+0x34): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `std::move_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> >*> std::__make_move_if_noexcept_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> >, std::move_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> >*> >(Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> >*)' defined in .text._ZSt32__make_move_if_noexcept_iteratorIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EjEENS0_4RectILi5ExEEEESt13move_iteratorIPS6_EET0_PT_[_ZSt32__make_move_if_noexcept_iteratorIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EjEENS0_4RectILi5ExEEEESt13move_iteratorIPS6_EET0_PT_] section in _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/stl_uninitialized.h:311:(.text._ZSt34__uninitialized_move_if_noexcept_aIPN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EjEENS0_4RectILi5ExEEEES7_SaIS6_EET0_T_SA_S9_RT1_[_ZSt34__uninitialized_move_if_noexcept_aIPN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EjEENS0_4RectILi5ExEEEES7_SaIS6_EET0_T_SA_S9_RT1_]+0x44): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `std::move_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> >*> std::__make_move_if_noexcept_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> >, std::move_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> >*> >(Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Rect<5, long long> >*)' defined in .text._ZSt32__make_move_if_noexcept_iteratorIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EjEENS0_4RectILi5ExEEEESt13move_iteratorIPS6_EET0_PT_[_ZSt32__make_move_if_noexcept_iteratorIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EjEENS0_4RectILi5ExEEEESt13move_iteratorIPS6_EET0_PT_] section in _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o
_deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o: In function `std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, long long>, Realm::Point<5, unsigned int> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, long long>, Realm::Point<5, unsigned int> > > >::vector(std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, long long>, Realm::Point<5, unsigned int> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, long long>, Realm::Point<5, unsigned int> > > > const&)':
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/stl_vector.h:460:(.text._ZNSt6vectorIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5ExEENS0_5PointILi5EjEEEESaIS6_EEC2ERKS8_[_ZNSt6vectorIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5ExEENS0_5PointILi5EjEEEESaIS6_EEC5ERKS8_]+0x38): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, long long>, Realm::Point<5, unsigned int> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, long long>, Realm::Point<5, unsigned int> > > >::size() const' defined in .text._ZNKSt6vectorIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5ExEENS0_5PointILi5EjEEEESaIS6_EE4sizeEv[_ZNKSt6vectorIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5ExEENS0_5PointILi5EjEEEESaIS6_EE4sizeEv] section in _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o
_deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o: In function `std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, unsigned int> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, unsigned int> > > >::begin() const':
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/stl_vector.h:708:(.text._ZNKSt6vectorIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EiEENS0_4RectILi5EjEEEESaIS6_EE5beginEv[_ZNKSt6vectorIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EiEENS0_4RectILi5EjEEEESaIS6_EE5beginEv]+0x3c): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `__gnu_cxx::__normal_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, unsigned int> > const*, std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, unsigned int> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, unsigned int> > > > >::__normal_iterator(Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, unsigned int> > const* const&)' defined in .text._ZN9__gnu_cxx17__normal_iteratorIPKN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5EiEENS1_4RectILi5EjEEEESt6vectorIS7_SaIS7_EEEC2ERKS9_[_ZN9__gnu_cxx17__normal_iteratorIPKN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5EiEENS1_4RectILi5EjEEEESt6vectorIS7_SaIS7_EEEC5ERKS9_] section in _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o
_deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o: In function `std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, unsigned int> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, unsigned int> > > >::end() const':
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/stl_vector.h:726:(.text._ZNKSt6vectorIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EiEENS0_4RectILi5EjEEEESaIS6_EE3endEv[_ZNKSt6vectorIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EiEENS0_4RectILi5EjEEEESaIS6_EE3endEv]+0x3c): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `__gnu_cxx::__normal_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, unsigned int> > const*, std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, unsigned int> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, unsigned int> > > > >::__normal_iterator(Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, unsigned int> > const* const&)' defined in .text._ZN9__gnu_cxx17__normal_iteratorIPKN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5EiEENS1_4RectILi5EjEEEESt6vectorIS7_SaIS7_EEEC2ERKS9_[_ZN9__gnu_cxx17__normal_iteratorIPKN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5EiEENS1_4RectILi5EjEEEESt6vectorIS7_SaIS7_EEEC5ERKS9_] section in _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o
_deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o: In function `std::allocator_traits<std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, long long>, Realm::Rect<5, int> > > >::allocate(std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, long long>, Realm::Rect<5, int> > >&, unsigned long)':
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/alloc_traits.h:436:(.text._ZNSt16allocator_traitsISaIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5ExEENS0_4RectILi5EiEEEEEE8allocateERS7_m[_ZNSt16allocator_traitsISaIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5ExEENS0_4RectILi5EiEEEEEE8allocateERS7_m]+0x30): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `__gnu_cxx::new_allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, long long>, Realm::Rect<5, int> > >::allocate(unsigned long, void const*)' defined in .text._ZN9__gnu_cxx13new_allocatorIN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5ExEENS1_4RectILi5EiEEEEE8allocateEmPKv[_ZN9__gnu_cxx13new_allocatorIN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5ExEENS1_4RectILi5EiEEEEE8allocateEmPKv] section in _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o
_deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o: In function `__gnu_cxx::__alloc_traits<std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Point<5, unsigned int> > >, Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Point<5, unsigned int> > >::_S_select_on_copy(std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Point<5, unsigned int> > > const&)':
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/ext/alloc_traits.h:95:(.text._ZN9__gnu_cxx14__alloc_traitsISaIN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5EjEENS1_5PointILi5EjEEEEES7_E17_S_select_on_copyERKS8_[_ZN9__gnu_cxx14__alloc_traitsISaIN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5EjEENS1_5PointILi5EjEEEEES7_E17_S_select_on_copyERKS8_]+0x30): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `std::allocator_traits<std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Point<5, unsigned int> > > >::select_on_container_copy_construction(std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, unsigned int>, Realm::Point<5, unsigned int> > > const&)' defined in .text._ZNSt16allocator_traitsISaIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EjEENS0_5PointILi5EjEEEEEE37select_on_container_copy_constructionERKS7_[_ZNSt16allocator_traitsISaIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EjEENS0_5PointILi5EjEEEEEE37select_on_container_copy_constructionERKS7_] section in _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o
_deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o: In function `Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, int> >* std::__uninitialized_copy<false>::__uninit_copy<std::move_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, int> >*>, Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, int> >*>(std::move_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, int> >*>, std::move_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, int> >*>, Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, int> >*)':
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/stl_uninitialized.h:83:(.text._ZNSt20__uninitialized_copyILb0EE13__uninit_copyISt13move_iteratorIPN5Realm19FieldDataDescriptorINS3_10IndexSpaceILi5EiEENS3_4RectILi5EiEEEEESA_EET0_T_SD_SC_[_ZNSt20__uninitialized_copyILb0EE13__uninit_copyISt13move_iteratorIPN5Realm19FieldDataDescriptorINS3_10IndexSpaceILi5EiEENS3_4RectILi5EiEEEEESA_EET0_T_SD_SC_]+0x84): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `void std::_Construct<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, int> >, Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, int> > >(Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, int> >*, Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Rect<5, int> >&&)' defined in .text._ZSt10_ConstructIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EiEENS0_4RectILi5EiEEEEJS6_EEvPT_DpOT0_[_ZSt10_ConstructIN5Realm19FieldDataDescriptorINS0_10IndexSpaceILi5EiEENS0_4RectILi5EiEEEEJS6_EEvPT_DpOT0_] section in _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o
_deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o: In function `Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> >* std::__uninitialized_copy<false>::__uninit_copy<__gnu_cxx::__normal_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> > const*, std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> > > > >, Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> >*>(__gnu_cxx::__normal_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> > const*, std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> > > > >, __gnu_cxx::__normal_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> > const*, std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> > > > >, Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> >*)':
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/stl_uninitialized.h:83:(.text._ZNSt20__uninitialized_copyILb0EE13__uninit_copyIN9__gnu_cxx17__normal_iteratorIPKN5Realm19FieldDataDescriptorINS4_10IndexSpaceILi5EiEENS4_5PointILi5EiEEEESt6vectorISA_SaISA_EEEEPSA_EET0_T_SJ_SI_[_ZNSt20__uninitialized_copyILb0EE13__uninit_copyIN9__gnu_cxx17__normal_iteratorIPKN5Realm19FieldDataDescriptorINS4_10IndexSpaceILi5EiEENS4_5PointILi5EiEEEESt6vectorISA_SaISA_EEEEPSA_EET0_T_SJ_SI_]+0x70): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `__gnu_cxx::__normal_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> > const*, std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, int> > > > >::operator*() const' defined in .text._ZNK9__gnu_cxx17__normal_iteratorIPKN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5EiEENS1_5PointILi5EiEEEESt6vectorIS7_SaIS7_EEEdeEv[_ZNK9__gnu_cxx17__normal_iteratorIPKN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5EiEENS1_5PointILi5EiEEEESt6vectorIS7_SaIS7_EEEdeEv] section in _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o
_deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o: In function `bool __gnu_cxx::operator!=<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, long long> > const*, std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, long long> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, long long> > > > >(__gnu_cxx::__normal_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, long long> > const*, std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, long long> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, long long> > > > > const&, __gnu_cxx::__normal_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, long long> > const*, std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, long long> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, long long> > > > > const&)':
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/stl_iterator.h:887:(.text._ZN9__gnu_cxxneIPKN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5EiEENS1_5PointILi5ExEEEESt6vectorIS7_SaIS7_EEEEbRKNS_17__normal_iteratorIT_T0_EESI_[_ZN9__gnu_cxxneIPKN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5EiEENS1_5PointILi5ExEEEESt6vectorIS7_SaIS7_EEEEbRKNS_17__normal_iteratorIT_T0_EESI_]+0x2c): relocation truncated to fit: R_PPC64_REL24 (stub) against symbol `__gnu_cxx::__normal_iterator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, long long> > const*, std::vector<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, long long> >, std::allocator<Realm::FieldDataDescriptor<Realm::IndexSpace<5, int>, Realm::Point<5, long long> > > > >::base() const' defined in .text._ZNK9__gnu_cxx17__normal_iteratorIPKN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5EiEENS1_5PointILi5ExEEEESt6vectorIS7_SaIS7_EEE4baseEv[_ZNK9__gnu_cxx17__normal_iteratorIPKN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5EiEENS1_5PointILi5ExEEEESt6vectorIS7_SaIS7_EEE4baseEv] section in _deps/legion-build/runtime/CMakeFiles/LegionRuntime.dir/legion/region_tree_5_5.cc.o
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/stl_iterator.h:887:(.text._ZN9__gnu_cxxneIPKN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5EiEENS1_5PointILi5ExEEEESt6vectorIS7_SaIS7_EEEEbRKNS_17__normal_iteratorIT_T0_EESI_[_ZN9__gnu_cxxneIPKN5Realm19FieldDataDescriptorINS1_10IndexSpaceILi5EiEENS1_5PointILi5ExEEEESt6vectorIS7_SaIS7_EEEEbRKNS_17__normal_iteratorIT_T0_EESI_]+0x40): additional relocation overflows omitted from the output
That's mostly an issue with your linker trying to shoehorn something that needs more than 24-bits of address space into a tiny 24-bit address space. You can try adding this flag to your link flags -mcmodel=large
or you can try doing a --debug-release
build.
That helped, thanks! I could generate the two backtraces attached. node1_bt.txt node0_bt.txt
I've done more tests and realized that the hang has nothing to do with the number of nodes. It hangs even on a single node when M >1 (where M is the first dimension of my arrays of shape (M, N, K, K, K)), but runs when M = 1. I have generated an updated backtrace for a single node run. bt_single_node.txt
This backtrace doesn't look like a hang to me. It just looks like it is running really slowly. Can you provide a reproducer program and a command line for us to play with? I suspect you'll see the issue on other GPU machines that are not PowerPC.
Also, what is the behavior if you run only with CPUs and no GPUs?
I have been trying to make a smaller reproducer, but commenting out different parts of the code will make it run normally/trigger the very slow execution. So I can't really isolate a part of the code that is causing this issue.
After more tests, I've also noticed that's it's probably not (completely) due to 5D arrays: if I decrease the volume of the arrays by decreasing K, I can run with M >= 2 on a single node. For example, with arrays of shape (2, N, 80, 80, 80), the code execute normally, but with arrays of shape (2, N, 96, 96, 96) I have this very slow execution. It's also not due to the total volume of the arrays, as I can run normally with arrays of shape (1, N, 256, 256, 256).
When using only CPUs the code execute normally in all cases.
Software versions
Python : 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 16:04:32) [GCC 12.3.0] Platform : Linux-4.14.0-115.35.1.3chaos.ch6a.ppc64le-ppc64le-with-glibc2.17 Legion : legion-23.09.0-4871-g04ee5be1d Legate : 23.11.00.dev+57.gde1ad0f Cunumeric : 23.11.00.dev+33.g8693a3d6 Numpy : 1.26.3 Scipy : 1.12.0 Numba : 0.58.1 CTK package : (failed to detect) GPU driver : 510.47.03 GPU devices :
GPU 0: Tesla V100-SXM2-16GB GPU 1: Tesla V100-SXM2-16GB GPU 2: Tesla V100-SXM2-16GB GPU 3: Tesla V100-SXM2-16GB
Expected behavior
I have an application operating on 5D arrays of shape (M, N, K, K, K), where N is fixed. The application works on 1 node (4 GPUs). I attempt two types of scaling:
Observed behavior
Point 2. above does not result in any error, but the code seems to indefinitely hang, even before starting any computation.
Example code or instructions
The node is executed on a PowerPC 9 system with 4 V100 GPUs per node. It is launched with:
Stack traceback or browser console output
None.