mfem / mfem

Lightweight, general, scalable C++ library for finite element methods
http://mfem.org
BSD 3-Clause "New" or "Revised" License
1.67k stars 486 forks source link

HYPRE+HIP Runtime Error #2910

Closed wcdawn closed 2 years ago

wcdawn commented 2 years ago

I'm trying to compile with HIP & HYPRE. Compiling with just HIP works fine, but I'd like to use HypreBoomerAMG. I get the following runtime error when running ex1p. Any help would be much appreciated.

Options used:
   --mesh ../data/star.mesh
   --order 1
   --no-static-condensation
   --no-partial-assembly
   --device cpu
   --visualization
Device configuration: cpu
Memory configuration: host-std
Number of finite element unknowns: 82561
Memory access fault by GPU node-1 (Agent handle: 0x24fa560) on address 0x7f3f67a0e000. Reason: Page not present or supervisor privilege.
[jezebel:1158855] *** Process received signal ***
[jezebel:1158855] Signal: Aborted (6)
[jezebel:1158855] Signal code:  (-6)
[jezebel:1158855] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0)[0x7f4103a4b3c0]
[jezebel:1158855] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f40e7a7703b]
[jezebel:1158855] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f40e7a56859]
[jezebel:1158855] [ 3] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x72847)[0x7f40e75d1847]
[jezebel:1158855] [ 4] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x7545b)[0x7f40e75d445b]
[jezebel:1158855] [ 5] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x18757)[0x7f40e7577757]
[jezebel:1158855] [ 6] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f4103a3f609]
[jezebel:1158855] [ 7] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f40e7b53163]
[jezebel:1158855] *** End of error message ***
Aborted (core dumped)

I'm using the master branch of MFEM & HYPRE v2.24.0.

HYRPE config

./configure \
  --with-hip \
  --disable-fortran \
  CC=/home/wcdawn/bin/openmpi-4.1.2/bin/mpicc \
  CXX=/home/wcdawn/bin/openmpi-4.1.2/bin/mpicxx \
  --with-MPI-include='/home/wcdawn/bin/openmpi-4.1.2/include' \
  --with-MPI-libs='mpi mpi_cxx' \
  --with-MPI-lib-dirs='/home/wcdawn/bin/openmpi-4.1.2/lib'

MFEM config

make config \
  MFEM_USE_HIP=YES HIP_ARCH=gfx1030 \
  MFEM_USE_MPI=YES \
  MPICC=/home/wcdawn/bin/openmpi-4.1.2/bin/mpicc \
  MPICXX=/home/wcdawn/bin/openmpi-4.1.2/bin/mpicxx \
  HYPRE_DIR=@MFEM_DIR@/../hypre-2.24.0/src/hypre \
  METIS_DIR=@MFEM_DIR@/../metis-5.1.0 MFEM_USE_METIS_5=YES
tzanio commented 2 years ago

@wcdawn, just to clarify -- you are using mfem:master as of today, i.e. mfem-4.4?

wcdawn commented 2 years ago

@tzanio correct. MFEM v4.4. I just did a pull today.

jandrej commented 2 years ago
Device configuration: cpu
Memory configuration: host-std

looks suspicious?

wcdawn commented 2 years ago

@jandrej I get the same message when passing -d hip. Here is the full command line that I ran.

$ /home/wcdawn/bin/openmpi-4.1.2/bin/mpirun -np 1 ./ex1p -d hip

Options used:
   --mesh ../data/star.mesh
   --order 1
   --no-static-condensation
   --no-partial-assembly
   --device hip
   --visualization
Device configuration: hip,cpu
Memory configuration: host-std,hip
Number of finite element unknowns: 82561
Memory access fault by GPU node-1 (Agent handle: 0x3c04670) on address 0x7f3c0fe04000. Reason: Page not present or supervisor privilege.
[jezebel:1158921] *** Process received signal ***
[jezebel:1158921] Signal: Aborted (6)
[jezebel:1158921] Signal code:  (-6)
[jezebel:1158921] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0)[0x7f3d538003c0]
[jezebel:1158921] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f3d3782c03b]
[jezebel:1158921] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f3d3780b859]
[jezebel:1158921] [ 3] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x72847)[0x7f3d37386847]
[jezebel:1158921] [ 4] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x7545b)[0x7f3d3738945b]
[jezebel:1158921] [ 5] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x18757)[0x7f3d3732c757]
[jezebel:1158921] [ 6] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f3d537f4609]
[jezebel:1158921] [ 7] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f3d37908163]
[jezebel:1158921] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node jezebel exited on signal 6 (Aborted).
--------------------------------------------------------------------------
pazner commented 2 years ago

Hello @wcdawn,

Are you able to get a backtrace by running with gdb?

Also, we have tested with hypre version 2.23, do you encounter the same crashes with that version as well?

wcdawn commented 2 years ago

@pazner I rebuilt with HYPRE v2.23.0 and get the same error. I also rebuilt MFEM with MFEM_DEBUG=YES so I could get backtrace info. Here is the backtrace.

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fffdbf95859 in __GI_abort () at abort.c:79
#2  0x00007fffdbb10847 in rocr::core::Runtime::VMFaultHandler(long, void*) ()
   from /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1
#3  0x00007fffdbb1345b in rocr::core::Runtime::AsyncEventsLoop(void*) ()
   from /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1
#4  0x00007fffdbab6757 in rocr::os::ThreadTrampoline(void*) () from /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1
#5  0x00007ffff7f7e609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6  0x00007fffdc092163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
v-dobrev commented 2 years ago

It looks like this stack trace and crash happen on a separate thread, not on the main thread. Can you try switching to the main thread and getting the stack trace there? I suspect that this new thread that crashes is created during some call to the HIP/ROCm runtime but it is best to confirm that and find out what call exactly causes this.

wcdawn commented 2 years ago

@v-dobrev Here is the output of info threads from gdb

  Id   Target Id                                  Frame
  1    Thread 0x7fffdb7f7980 (LWP 1183544) "ex1p" 0x00007fffdc00d231 in tcache_get (tc_idx=<optimized out>) at malloc.c:2937
  2    Thread 0x7fffdb704700 (LWP 1183548) "ex1p" 0x00007fffdc0859cf in __GI___poll (fds=0x7fffd4000b60, nfds=1, timeout=3599969) at ../sysdeps/unix/sysv/linux/poll.c:29
  3    Thread 0x7fffdad42700 (LWP 1183549) "ex1p" 0x00007fffdc09249e in epoll_wait (epfd=10, events=0x2250d20, maxevents=32, timeout=119968) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
* 4    Thread 0x7fffd99df700 (LWP 1183550) "ex1p" __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
  6    Thread 0x7fffd8f7f700 (LWP 1183552) "ex1p" 0x00007fffdc0873db in ioctl () at ../sysdeps/unix/syscall-template.S:78

The backtrace from thread 1 looks interesting.

(gdb) thread 1
[Switching to thread 1 (Thread 0x7fffdb7f7980 (LWP 1183544))]
#0  0x00007fffdc00d231 in tcache_get (tc_idx=<optimized out>) at malloc.c:2937
2937    malloc.c: No such file or directory.
(gdb) bt
#0  0x00007fffdc00d231 in tcache_get (tc_idx=<optimized out>) at malloc.c:2937
#1  __GI___libc_malloc (bytes=23) at malloc.c:3051
#2  0x00007fffdc20fb39 in operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fffc96c891f in ?? () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#4  0x00007fffc5205994 in amd_comgr_get_metadata_string () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#5  0x00007ffff71eac24 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#6  0x00007ffff71ecc43 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#7  0x00007fffc5205c51 in amd_comgr_iterate_map_metadata () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#8  0x00007ffff71ee5f8 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#9  0x00007ffff71c1e7b in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#10 0x00007ffff7180653 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#11 0x00007ffff718108d in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#12 0x00007ffff71a9dec in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#13 0x00007ffff702265e in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#14 0x00007ffff702418e in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#15 0x00007ffff6fe5f7d in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#16 0x00007ffff71073bf in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#17 0x00007ffff70ea857 in hipLaunchKernel () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#18 0x0000000001d51808 in std::enable_if<rocprim::detail::default_scan_config<0u, int>::use_lookback, hipError_t>::type rocprim::detail::scan_impl<true, rocprim::detail::default_scan_config<0u, int>, int*, int*, int, thrust::plus<void> >(void*, unsigned long&, int*, int*,
int, unsigned long, thrust::plus<void>, ihipStream_t*, bool) ()
#19 0x0000000001d50b08 in thrust::hip_rocprim::exclusive_scan_n<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base>, int*, long, int*, int, thrust::plus<void> >(thrust::hip_rocprim::execution_policy<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base> >&, int*, long, int*, int, thrust::plus<void>)::workaround::par(thrust::hip_rocprim::execution_policy<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base> >&, int*, long, int*, int, thrust::plus<void>) ()
#20 0x0000000001dc3311 in hypreDevice_IntegerExclusiveScan(int, int*) ()
#21 0x0000000001d73db2 in hypre_BoomerAMGCreateSDevice ()
#22 0x0000000001c4c027 in hypre_BoomerAMGCreateS ()
#23 0x0000000001bdd85f in hypre_BoomerAMGSetup ()
#24 0x0000000001428131 in mfem::HypreSolver::Mult (this=0x4b96d10, b=..., x=...) at linalg/hypre.cpp:3638
#25 0x0000000001428f53 in mfem::HypreSolver::Mult (this=0x4b96d10, b=..., x=...) at linalg/hypre.cpp:3721
#26 0x000000000143b5ad in mfem::CGSolver::Mult (this=0x7fffffffa238, b=..., x=...) at linalg/solvers.cpp:730
#27 0x0000000001397596 in main (argc=<optimized out>, argv=<optimized out>) at ex1p.cpp:255

I'm starting to suspect that it could be a problem with the HIP/ROCm runtime as well.

v-dobrev commented 2 years ago

I just noticed that you did not set the GPU arch in your hypre config command -- try adding --with-gpu-arch=gfx... -- I'm not sure what happens when it is not set.

wcdawn commented 2 years ago

Thanks for catching that. It doesn't seem to have changed anything and the backtrace looks the same.

#0  0x00007fffc99b3b91 in ?? () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#1  0x00007fffc96c87ac in ?? () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#2  0x00007fffc5205994 in amd_comgr_get_metadata_string () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#3  0x00007ffff71eac24 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#4  0x00007ffff71eb738 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#5  0x00007fffc5205c51 in amd_comgr_iterate_map_metadata () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#6  0x00007ffff71ee131 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#7  0x00007ffff71ee595 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#8  0x00007ffff71c1e7b in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#9  0x00007ffff7180653 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#10 0x00007ffff718108d in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#11 0x00007ffff71a9dec in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#12 0x00007ffff702265e in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#13 0x00007ffff702418e in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#14 0x00007ffff6fe5f7d in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#15 0x00007ffff71073bf in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#16 0x00007ffff70ea857 in hipLaunchKernel () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#17 0x0000000001d51808 in std::enable_if<rocprim::detail::default_scan_config<0u, int>::use_lookback, hipError_t>::type rocprim::detail::scan_impl<true, rocprim::detail::default_scan_config<0u, int>, int*, int*, int, thrust::plus<void> >(void*, unsigned long&, int*, int*, int, unsigned long, thrust::plus<void>, ihipStream_t*, bool) ()
#18 0x0000000001d50b08 in thrust::hip_rocprim::exclusive_scan_n<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base>, int*, long, int*, int, thrust::plus<void> >(thrust::hip_rocprim::execution_policy<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base> >&, int*, long, int*, int, thrust::plus<void>)::workaround::par(thrust::hip_rocprim::execution_policy<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base> >&, int*, long, int*, int, thrust::plus<void>) ()
#19 0x0000000001dc3311 in hypreDevice_IntegerExclusiveScan(int, int*) ()
#20 0x0000000001d73db2 in hypre_BoomerAMGCreateSDevice ()
#21 0x0000000001c4c027 in hypre_BoomerAMGCreateS ()
#22 0x0000000001bdd85f in hypre_BoomerAMGSetup ()
#23 0x0000000001428131 in mfem::HypreSolver::Mult (this=0x4b96d10, b=..., x=...) at linalg/hypre.cpp:3638
#24 0x0000000001428f53 in mfem::HypreSolver::Mult (this=0x4b96d10, b=..., x=...) at linalg/hypre.cpp:3721
#25 0x000000000143b5ad in mfem::CGSolver::Mult (this=0x7fffffffa248, b=..., x=...) at linalg/solvers.cpp:730
#26 0x0000000001397596 in main (argc=<optimized out>, argv=<optimized out>) at ex1p.cpp:255
v-dobrev commented 2 years ago

Another suggestion/question: were you able to run older MFEM versions on this machine, e.g. right after https://github.com/mfem/mfem/pull/2750 was merged?

Also, just to confirm, if you build MFEM with HIP and HYPRE without HIP, does this work?

wcdawn commented 2 years ago

I did git checkout 4e6b6f7d472b65d5fd25855a309e1666df7209d1 which was the commit that merged #2750 and got the same error.

Building MFEM with HIP and HYPRE without HIP does work. Additionally, building both without HIP works.

Is there a HYPE example to use to test this? It seems like it could be something in HYPRE itself or maybe something in the MFEM/HYRPE interface.

v-dobrev commented 2 years ago

Hi @wcdawn,

were you able to figure out what the problem is?

@liruipeng, we suspect that the above issue (see the backtrace here: https://github.com/mfem/mfem/issues/2910#issuecomment-1077510700) maybe in hypre. What will be a good way for @wcdawn to test this in hypre itself without mfem?

wcdawn commented 2 years ago

@v-dobrev Unfortunately not. I think it could be something with HYPRE. I'm not sure if it has been tested with this particular GPU.

v-dobrev commented 2 years ago

cc: @noelchalmers

noelchalmers commented 2 years ago

and @pbauman

pbauman commented 2 years ago

Hi folks. There is certainly an issue with HYPRE at present that I will try to address when I can. The issue is that the Navi gaming cards (gfx1030 indicates an RDNA2 card, so something like a 6900XT) run with warp/wavefront sizes of 32. Currently, HYPRE on AMD GPUs is setup for warp/wavefront size of 64. I'll post a note here when we update HYPRE to support wavefront size 32 on AMD GPUs.

v-dobrev commented 2 years ago

@pbauman, thank you for looking into this issue.

stale[bot] commented 2 years ago

:warning: This issue or PR has been automatically marked as stale because it has not had any activity in the last month. If no activity occurs in the next week, it will be automatically closed. Thank you for your contributions.

wcdawn commented 2 years ago

@pbauman @v-dobrev any update here or a potential timeline?

stale[bot] commented 2 years ago

:warning: This issue or PR has been automatically marked as stale because it has not had any activity in the last month. If no activity occurs in the next week, it will be automatically closed. Thank you for your contributions.