pmodels / oshmpi

OpenSHMEM Implementation on MPI
https://pmodels.github.io/oshmpi-www/
Other
25 stars 14 forks source link

Problems running oshmpi with Fujitsu MPI on Fugaku #105

Open tonycurtis opened 3 years ago

tonycurtis commented 3 years ago

I can build, but this is what I see on a compute node. Any idea?

(gdb) cont
Continuing.
[New Thread 0x4000025ff010 (LWP 14149)]
[New Thread 0x4000029ff010 (LWP 14150)]

Thread 1 "a.out" received signal SIGSEGV, Segmentation fault.
ompi_mfh_base_real_t_cvar_write () at pcvar_write.c:43
43  pcvar_write.c: No such file or directory.
(gdb) bt
#0  ompi_mfh_base_real_t_cvar_write () at pcvar_write.c:43
#1  ompi_mfh_ptl_t_cvar_write ()
    at ../../../../src/ompi/mca/mfh/ptl/mfh_ptl_call.h:692
#2  PMPI_T_cvar_write ()
    at ../../../../src/ompi/mca/mfh/base/mfh_base_func_defs.h:13523
#3  0x00004000000a62a8 in set_mpit_cvar (cvar_name=<optimized out>,
    val=<optimized out>) at ../oshmpi-git/src/internal/setup_impl.c:698
#4  0x00004000000a6354 in initialize_mpit ()
    at ../oshmpi-git/src/internal/setup_impl.c:708
#5  0x00004000000a65d4 in OSHMPI_initialize_thread (required=<optimized out>,
    provided=<optimized out>) at ../oshmpi-git/src/internal/setup_impl.c:780
#6  0x00004000000b24a0 in shmem_init () at ../oshmpi-git/src/shmem/setup.c:13
#7  0x0000000000400ee4 in main () at hello.c:64
(gdb) q
A debugging session is active.

    Inferior 1 [process 14143] will be detached.

Quit anyway? (y or n) y
Detaching from program: /vol0004/ra010008/XXXXXX/shmem/openshmem-examples/c/a.out, process 14143
[Inferior 1 (process 14143) detached]
[c34-0003c:14143] *** Process received signal ***
[c34-0003c:14143] Signal: Segmentation fault (11)
[c34-0003c:14143] Signal code: Address not mapped (1)
[c34-0003c:14143] Failing at address: 0x1
[c34-0003c:14143] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x40000006066c]
[c34-0003c:14143] [ 1] /opt/FJSVxtclanga/tcsds-1.2.30a/lib64/libmpi.so.0(PMPI_T_cvar_write+0x54)[0x40000023d574]
[c34-0003c:14143] [ 2] /home/ra010008/XXXXXX/opt/oshmpi/git/lib/liboshmpi.so.0(+0x162a8)[0x4000000a62a8]
[c34-0003c:14143] [ 3] /home/ra010008/XXXXXXopt/oshmpi/git/lib/liboshmpi.so.0(+0x16354)[0x4000000a6354]
[c34-0003c:14143] [ 4] /home/ra010008/XXXXXX/opt/oshmpi/git/lib/liboshmpi.so.0(OSHMPI_initialize_thread+0x270)[0x4000000a65d4]
[c34-0003c:14143] [ 5] /home/ra010008/XXXXXX/opt/oshmpi/git/lib/liboshmpi.so.0(shmem_init+0x24)[0x4000000b24a0]
[c34-0003c:14143] [ 6] ./a.out[0x400ee4]
[c34-0003c:14143] [ 7] /lib64/libc.so.6(__libc_start_main+0xe4)[0x400001030be4]
[c34-0003c:14143] [ 8] ./a.out[0x400dfc]
[c34-0003c:14143] *** End of error message ***
minsii commented 3 years ago

@tonycurtis Sorry I somehow did not get notification for this issue. Can you try #109 when get a chance? I am not sure if it resolves the issue, but it was an obvious bug in OSHMPI.

tonycurtis commented 3 years ago

Same problem, unfortunately.

minsii commented 3 years ago

@tonycurtis Can you please try #112 ? Set environment variable OSHMPI_ENABLE_MPI_T=0 when you run. E.g.,

OSHMPI_VERBOSE=1 OSHMPI_ENABLE_MPI_T=0 mpiexec -np 2 ./hello

It disables the MPI_T code.

tonycurtis commented 3 years ago

Same problem

minsii commented 3 years ago

It should no longer run the set_mpit_cvar function. I just added one debug message to the above PR.

Would you mind updating the code and run again? Please copy the output here with OSHMPI_VERBOSE=1 OSHMPI_ENABLE_MPI_T=0 SHMEM_DEBUG=1