Closed wcdawn closed 2 years ago
@wcdawn, just to clarify -- you are using mfem:master
as of today, i.e. mfem-4.4
?
@tzanio correct. MFEM v4.4. I just did a pull today.
Device configuration: cpu
Memory configuration: host-std
looks suspicious?
@jandrej I get the same message when passing -d hip
. Here is the full command line that I ran.
$ /home/wcdawn/bin/openmpi-4.1.2/bin/mpirun -np 1 ./ex1p -d hip
Options used:
--mesh ../data/star.mesh
--order 1
--no-static-condensation
--no-partial-assembly
--device hip
--visualization
Device configuration: hip,cpu
Memory configuration: host-std,hip
Number of finite element unknowns: 82561
Memory access fault by GPU node-1 (Agent handle: 0x3c04670) on address 0x7f3c0fe04000. Reason: Page not present or supervisor privilege.
[jezebel:1158921] *** Process received signal ***
[jezebel:1158921] Signal: Aborted (6)
[jezebel:1158921] Signal code: (-6)
[jezebel:1158921] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x143c0)[0x7f3d538003c0]
[jezebel:1158921] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f3d3782c03b]
[jezebel:1158921] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f3d3780b859]
[jezebel:1158921] [ 3] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x72847)[0x7f3d37386847]
[jezebel:1158921] [ 4] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x7545b)[0x7f3d3738945b]
[jezebel:1158921] [ 5] /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1(+0x18757)[0x7f3d3732c757]
[jezebel:1158921] [ 6] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f3d537f4609]
[jezebel:1158921] [ 7] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f3d37908163]
[jezebel:1158921] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node jezebel exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Hello @wcdawn,
Are you able to get a backtrace by running with gdb
?
Also, we have tested with hypre version 2.23, do you encounter the same crashes with that version as well?
@pazner I rebuilt with HYPRE v2.23.0 and get the same error. I also rebuilt MFEM with MFEM_DEBUG=YES
so I could get backtrace info. Here is the backtrace.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fffdbf95859 in __GI_abort () at abort.c:79
#2 0x00007fffdbb10847 in rocr::core::Runtime::VMFaultHandler(long, void*) ()
from /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1
#3 0x00007fffdbb1345b in rocr::core::Runtime::AsyncEventsLoop(void*) ()
from /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1
#4 0x00007fffdbab6757 in rocr::os::ThreadTrampoline(void*) () from /opt/rocm-5.0.0/hip/lib/../../lib/libhsa-runtime64.so.1
#5 0x00007ffff7f7e609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#6 0x00007fffdc092163 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
It looks like this stack trace and crash happen on a separate thread, not on the main thread. Can you try switching to the main thread and getting the stack trace there? I suspect that this new thread that crashes is created during some call to the HIP/ROCm runtime but it is best to confirm that and find out what call exactly causes this.
@v-dobrev Here is the output of info threads
from gdb
Id Target Id Frame
1 Thread 0x7fffdb7f7980 (LWP 1183544) "ex1p" 0x00007fffdc00d231 in tcache_get (tc_idx=<optimized out>) at malloc.c:2937
2 Thread 0x7fffdb704700 (LWP 1183548) "ex1p" 0x00007fffdc0859cf in __GI___poll (fds=0x7fffd4000b60, nfds=1, timeout=3599969) at ../sysdeps/unix/sysv/linux/poll.c:29
3 Thread 0x7fffdad42700 (LWP 1183549) "ex1p" 0x00007fffdc09249e in epoll_wait (epfd=10, events=0x2250d20, maxevents=32, timeout=119968) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
* 4 Thread 0x7fffd99df700 (LWP 1183550) "ex1p" __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
6 Thread 0x7fffd8f7f700 (LWP 1183552) "ex1p" 0x00007fffdc0873db in ioctl () at ../sysdeps/unix/syscall-template.S:78
The backtrace from thread 1 looks interesting.
(gdb) thread 1
[Switching to thread 1 (Thread 0x7fffdb7f7980 (LWP 1183544))]
#0 0x00007fffdc00d231 in tcache_get (tc_idx=<optimized out>) at malloc.c:2937
2937 malloc.c: No such file or directory.
(gdb) bt
#0 0x00007fffdc00d231 in tcache_get (tc_idx=<optimized out>) at malloc.c:2937
#1 __GI___libc_malloc (bytes=23) at malloc.c:3051
#2 0x00007fffdc20fb39 in operator new(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007fffc96c891f in ?? () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#4 0x00007fffc5205994 in amd_comgr_get_metadata_string () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#5 0x00007ffff71eac24 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#6 0x00007ffff71ecc43 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#7 0x00007fffc5205c51 in amd_comgr_iterate_map_metadata () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#8 0x00007ffff71ee5f8 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#9 0x00007ffff71c1e7b in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#10 0x00007ffff7180653 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#11 0x00007ffff718108d in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#12 0x00007ffff71a9dec in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#13 0x00007ffff702265e in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#14 0x00007ffff702418e in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#15 0x00007ffff6fe5f7d in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#16 0x00007ffff71073bf in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#17 0x00007ffff70ea857 in hipLaunchKernel () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#18 0x0000000001d51808 in std::enable_if<rocprim::detail::default_scan_config<0u, int>::use_lookback, hipError_t>::type rocprim::detail::scan_impl<true, rocprim::detail::default_scan_config<0u, int>, int*, int*, int, thrust::plus<void> >(void*, unsigned long&, int*, int*,
int, unsigned long, thrust::plus<void>, ihipStream_t*, bool) ()
#19 0x0000000001d50b08 in thrust::hip_rocprim::exclusive_scan_n<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base>, int*, long, int*, int, thrust::plus<void> >(thrust::hip_rocprim::execution_policy<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base> >&, int*, long, int*, int, thrust::plus<void>)::workaround::par(thrust::hip_rocprim::execution_policy<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base> >&, int*, long, int*, int, thrust::plus<void>) ()
#20 0x0000000001dc3311 in hypreDevice_IntegerExclusiveScan(int, int*) ()
#21 0x0000000001d73db2 in hypre_BoomerAMGCreateSDevice ()
#22 0x0000000001c4c027 in hypre_BoomerAMGCreateS ()
#23 0x0000000001bdd85f in hypre_BoomerAMGSetup ()
#24 0x0000000001428131 in mfem::HypreSolver::Mult (this=0x4b96d10, b=..., x=...) at linalg/hypre.cpp:3638
#25 0x0000000001428f53 in mfem::HypreSolver::Mult (this=0x4b96d10, b=..., x=...) at linalg/hypre.cpp:3721
#26 0x000000000143b5ad in mfem::CGSolver::Mult (this=0x7fffffffa238, b=..., x=...) at linalg/solvers.cpp:730
#27 0x0000000001397596 in main (argc=<optimized out>, argv=<optimized out>) at ex1p.cpp:255
I'm starting to suspect that it could be a problem with the HIP/ROCm runtime as well.
I just noticed that you did not set the GPU arch in your hypre config command -- try adding --with-gpu-arch=gfx...
-- I'm not sure what happens when it is not set.
Thanks for catching that. It doesn't seem to have changed anything and the backtrace looks the same.
#0 0x00007fffc99b3b91 in ?? () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#1 0x00007fffc96c87ac in ?? () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#2 0x00007fffc5205994 in amd_comgr_get_metadata_string () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#3 0x00007ffff71eac24 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#4 0x00007ffff71eb738 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#5 0x00007fffc5205c51 in amd_comgr_iterate_map_metadata () from /opt/rocm-5.0.0/hip/lib/../../lib/libamd_comgr.so.2
#6 0x00007ffff71ee131 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#7 0x00007ffff71ee595 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#8 0x00007ffff71c1e7b in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#9 0x00007ffff7180653 in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#10 0x00007ffff718108d in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#11 0x00007ffff71a9dec in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#12 0x00007ffff702265e in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#13 0x00007ffff702418e in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#14 0x00007ffff6fe5f7d in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#15 0x00007ffff71073bf in ?? () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#16 0x00007ffff70ea857 in hipLaunchKernel () from /opt/rocm-5.0.0/hip/lib/libamdhip64.so.5
#17 0x0000000001d51808 in std::enable_if<rocprim::detail::default_scan_config<0u, int>::use_lookback, hipError_t>::type rocprim::detail::scan_impl<true, rocprim::detail::default_scan_config<0u, int>, int*, int*, int, thrust::plus<void> >(void*, unsigned long&, int*, int*, int, unsigned long, thrust::plus<void>, ihipStream_t*, bool) ()
#18 0x0000000001d50b08 in thrust::hip_rocprim::exclusive_scan_n<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base>, int*, long, int*, int, thrust::plus<void> >(thrust::hip_rocprim::execution_policy<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base> >&, int*, long, int*, int, thrust::plus<void>)::workaround::par(thrust::hip_rocprim::execution_policy<thrust::detail::execute_with_allocator<hypre_device_allocator&, thrust::hip_rocprim::execute_on_stream_base> >&, int*, long, int*, int, thrust::plus<void>) ()
#19 0x0000000001dc3311 in hypreDevice_IntegerExclusiveScan(int, int*) ()
#20 0x0000000001d73db2 in hypre_BoomerAMGCreateSDevice ()
#21 0x0000000001c4c027 in hypre_BoomerAMGCreateS ()
#22 0x0000000001bdd85f in hypre_BoomerAMGSetup ()
#23 0x0000000001428131 in mfem::HypreSolver::Mult (this=0x4b96d10, b=..., x=...) at linalg/hypre.cpp:3638
#24 0x0000000001428f53 in mfem::HypreSolver::Mult (this=0x4b96d10, b=..., x=...) at linalg/hypre.cpp:3721
#25 0x000000000143b5ad in mfem::CGSolver::Mult (this=0x7fffffffa248, b=..., x=...) at linalg/solvers.cpp:730
#26 0x0000000001397596 in main (argc=<optimized out>, argv=<optimized out>) at ex1p.cpp:255
Another suggestion/question: were you able to run older MFEM versions on this machine, e.g. right after https://github.com/mfem/mfem/pull/2750 was merged?
Also, just to confirm, if you build MFEM with HIP and HYPRE without HIP, does this work?
I did git checkout 4e6b6f7d472b65d5fd25855a309e1666df7209d1
which was the commit that merged #2750 and got the same error.
Building MFEM with HIP and HYPRE without HIP does work. Additionally, building both without HIP works.
Is there a HYPE example to use to test this? It seems like it could be something in HYPRE itself or maybe something in the MFEM/HYRPE interface.
Hi @wcdawn,
were you able to figure out what the problem is?
@liruipeng, we suspect that the above issue (see the backtrace here: https://github.com/mfem/mfem/issues/2910#issuecomment-1077510700) maybe in hypre. What will be a good way for @wcdawn to test this in hypre itself without mfem?
@v-dobrev Unfortunately not. I think it could be something with HYPRE. I'm not sure if it has been tested with this particular GPU.
cc: @noelchalmers
and @pbauman
Hi folks. There is certainly an issue with HYPRE at present that I will try to address when I can. The issue is that the Navi gaming cards (gfx1030 indicates an RDNA2 card, so something like a 6900XT) run with warp/wavefront sizes of 32. Currently, HYPRE on AMD GPUs is setup for warp/wavefront size of 64. I'll post a note here when we update HYPRE to support wavefront size 32 on AMD GPUs.
@pbauman, thank you for looking into this issue.
:warning: This issue or PR has been automatically marked as stale because it has not had any activity in the last month. If no activity occurs in the next week, it will be automatically closed. Thank you for your contributions.
@pbauman @v-dobrev any update here or a potential timeline?
:warning: This issue or PR has been automatically marked as stale because it has not had any activity in the last month. If no activity occurs in the next week, it will be automatically closed. Thank you for your contributions.
I'm trying to compile with HIP & HYPRE. Compiling with just HIP works fine, but I'd like to use
HypreBoomerAMG
. I get the following runtime error when runningex1p
. Any help would be much appreciated.I'm using the
master
branch of MFEM & HYPRE v2.24.0.HYRPE config
MFEM config