open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 860 forks source link

failure during MPI_Win_detach #7384

Closed naughtont3 closed 3 years ago

naughtont3 commented 4 years ago

Background information

Application failure with Open MPI with one sided communication (OSC).

Reporting on behalf of user to help track problem.

The test works fine with MPICH/3.x, Spectrum MPI, and Intel MPI.

What version of Open MPI are you using?

Describe how Open MPI was installed

Standard tarball build can reproduce, using gcc/8.x compiler suite (gcc-8, g++-8, gfortran-8). Need gcc > 8.x to avoid past gfortran bugs.

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Reproducible on any Linux workstation (Need gcc > 8.x to avoid past gfortran bugs)


Details of the problem

You should be able to reproduce this failure on any Linux workstation. The only thing you need to make sure is to use the gcc/8.x compiler suite (gcc-8, g++-8, gfortran-8) since other versions are buggy in their gfortran part.

I added .txt extension to attach to github ticket.

Normally run.exatensor.sh invokes mpiexec with the binary directly, but for some reason the mpiexec from the latest GIT master branch fails to load some dynamic libraries (libgfortran), so I introduced a workaround where run.exatensor.sh invokes mpiexec with exec.sh, which in turn executes the binary Qforce.x. Previous OpenMPI versions did not have this issue by the way. But all of them fail in MPI_Win_detach as you can see below:

Destroying tensor dtens ... [exadesktop:32108] *** An error occurred in MPI_Win_detach
[exadesktop:32108] *** reported by process [3156279297,1]
[exadesktop:32108] *** on win rdma window 5
[exadesktop:32108] *** MPI_ERR_OTHER: known error not in list
[exadesktop:32108] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[exadesktop:32108] ***    and potentially your MPI job)
[exadesktop:32108] [0]func:/usr/local/mpi/openmpi/git/lib/libopen-pal.so.0(opal_backtrace_buffer+0x35) [0x149adfa0726f]
[exadesktop:32108] [1] func:/usr/local/mpi/openmpi/git/lib/libmpi.so.0(ompi_mpi_abort+0x9a) [0x149ae0574db1]
[exadesktop:32108] [2] func:/usr/local/mpi/openmpi/git/lib/libmpi.so.0(+0x48d6e) [0x149ae055ad6e]
[exadesktop:32108] [3]func:/usr/local/mpi/openmpi/git/lib/libmpi.so.0(ompi_mpi_errors_are_fatal_win_h andler+0xed) [0x149ae055a3d2]
[exadesktop:32108] [4] func:/usr/local/mpi/openmpi/git/lib/libmpi.so.0(ompi_errhandler_invoke+0x155) [0x149ae0559c11]
[exadesktop:32108] [5] func:/usr/local/mpi/openmpi/git/lib/libmpi.so.0(PMPI_Win_detach+0x197) [0x149ae05f2417]
[exadesktop:32108] [6] func:/usr/local/mpi/openmpi/git/lib/libmpi_mpifh.so.0(mpi_win_detach__+0x38) [0x149ae0946d86]
[exadesktop:32108] [7] func:./Qforce.x() [0x564a82]
[exadesktop:32108] [8] func:./Qforce.x() [0x564b42]
[exadesktop:32108] [9] func:./Qforce.x() [0x56df9e]
[exadesktop:32108] [10] func:./Qforce.x() [0x4319fa]
[exadesktop:32108] [11] func:./Qforce.x() [0x42a326]
[exadesktop:32108] [12] func:./Qforce.x() [0x42e2cc]
[exadesktop:32108] [13] func:./Qforce.x() [0x4de039]
[exadesktop:32108] [14] func:/usr/local/gcc/8.2.0/lib64/libgomp.so.1(+0x1743e) [0x149ae841343e]
[exadesktop:32108] [15] func:/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x149ae1d416db]
[exadesktop:32108] [16] func:/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x149adfdea88f]
[exadesktop:32103] PMIX ERROR: UNREACHABLE in file ../../../../../../../opal/mca/pmix/pmix4x/openpmix/src/server/pmix_server.c at line 2188
[exadesktop:32103] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[exadesktop:32103] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
DmitryLyakh commented 4 years ago

All OpenMPI versions I tried, including 3.x, 4.x, and the latest github master fail, although not sure if the problem is exactly the same in all cases. I only debugged the latest master and 4.0.2, where I found that the error occurs in /ompi/mca/osc/rdma/osc_rdma_dynamic.c in function ompi_osc_rdma_detach because ompi_osc_rdma_find_region_containing() does not find the dynamic memory region, but it should still exist (I checked the application code to the best I could).

jsquyres commented 4 years ago

Can you make a smaller reproducer, perchance?

hjelmn commented 4 years ago

I will use the one provided. Will quickly delete gfortran once I am done :p. Keep in mind that in some implementations attach/detach are essentially no-ops so success with another implementation does not necessarily mean there is a bug in Open MPI.

But given that attach/detach test coverage is incomplete it would not be surprising if there is a bug.

DmitryLyakh commented 4 years ago

Making a smaller reproducer would be extremely hard due to the code architecture, but I can assist with code navigation and debugging. In particular, there is only one source of MPI_Win_attach() in DDSS/distributed.F90 (subroutine DataWinAttach) and there is only one source of MPI_Win_detach() in DDSS/distributed.F90 (subroutine DataWinDetach). Similarly, all other MPI-3 RMA one-sided functionality is located in DDSS/distributed.F90. MPI dynamic windows are used with MPI_Rget() and MPI_Raccumulate() performing communications from within a PARALLEL OpenMP region, synchronized via MPI_Test() (inside the MPI_Win_lock and MPI_Win_unlock epoch).

DmitryLyakh commented 4 years ago

I am also trying to double check that my use of MPI-3 is valid, which I believe it is, but still there is a chance I am missed something.

hjelmn commented 4 years ago
f951: internal compiler error: in generate_finalization_wrapper, at fortran/class.c:1993
Please submit a full bug report,
with preprocessed source if appropriate.
DmitryLyakh commented 4 years ago

You are likely not using gcc/8.x.

DmitryLyakh commented 4 years ago

Only gfortran/8.x works, other versions have compiler bugs, including gfortran/9.x.

DmitryLyakh commented 4 years ago

I have just checked again and can confirm that the application is trying to detach a valid (previously attached) region which has not been detached before. Moreover, this is neither the first Detach call nor the last, if this can help. Also, no error code is returned from MPI_Win_detach() because of the crash, so I would assume the Open MPI would have returned an error code instead of crashing in case this issue was on the application side, right?

hjelmn commented 4 years ago

The problem occurs because the program attaches multiple regions that overlap least one page (minimum hardware registration unit). Taking a look at one crash:

Both of these regions contain the 4k page at 0x151090000000. Ideally osc/rdma should have returned an error as the implementation treats page overlap as region overlap. Overlapping regions are not allowed by the standard. The standard does not, however, give guidance on what overlapping means so it is fair to assume page-level overlap is allowed. This may be an error in the standard as the implementation is usually free to set restrictions based on hardware characteristics. I would very strongly recommend against 1) page overlapped regions, and 2) small regions (an 8 byte attach is incredibly wasteful). I will see about either allowing this kind of overlap or at least returning the proper error from MPI_Win_attach.

Both of these are trivial to implement, just want to make sure we follow the intended behavior (standard may be wrong here).

DmitryLyakh commented 4 years ago

Ha, this is subtle, but makes total sense from the implementation point of view. Thanks for such a quick investigation! I have always interpreted the standard as non-overlapping virtual ranges. On the other hand, can we safely assume that a 4K-byte page is always the minimum hardware registration unit on all systems, because otherwise the MPI standard will introduce a non-portable restriction? In any case, the error code and message from MPI_Win_attach would definitely help here. Thanks.

DmitryLyakh commented 4 years ago

I have just built the branch osc_rdma_allow_overlapping_registration_regions_and_return_the_correct_error_code_when_regions_overlap from https://github.com/hjelmn/ompi, but it results in exactly the same problem as before, even if I specify --mca osc_rdma_max_attach 128 in mpirun. Am I testing the wrong branch/commit? Commit I have is ec331c7998f84924713b7d1a97422ed7561cee7b Author: Nathan Hjelm hjelmn@google.com Date: Tue Feb 11 21:57:24 2020 -0800

naughtont3 commented 4 years ago

@hjelmn @DmitryLyakh I did a test with cherry-pick of #7383 and #7387 to point just prior to current PRRTE (no-orte) changes due to unrelated problems. For clarity, I pushed the branch I tested here: https://github.com/naughtont3/ompi/tree/pre-NoRTE-plus-oscrdma-fix

Things fail later, but there is still a failure during the detach with SEGV while inside ompi_osc_rdma_remove_attachment(). A debug log with osc_base_verbose and osc_rdma_verbose enabled is attached.

naughtont3 commented 4 years ago

I noticed that had omp threads=8, so I re-ran with OMP_NUM_THREADS=1 for slightly simpler case. It fails to find a memory attachment and throws the error during detach.

naughtont3 commented 4 years ago

It seems like the rdma_region_handle is ompi_osc_rdma_detach() is NULL. I am not sure why When that happens and you see "could not find dynamic memory attachment" it returns OMPI_ERR_BASE, which triggers the MPI win error handler for errors-are-fatal. So question is why/where does the rdma_region_handle get off?

hjelmn commented 4 years ago

Think I found the issue. Please try it again.

naughtont3 commented 4 years ago

This seems better for my test. But I will have to check with Dmitry to ensure it behaves as expected for him.

DmitryLyakh commented 4 years ago

I am unable to test this as I cannot find which branch/commit I am supposed to try. The last commit I see in Nathan's repository on branch osc_rdma_allow_overlapping_registration_regions_and_return_the_correct_error_code_when_regions_overlap is dated Feb 11. Where is the latest commit with today's fix?

DmitryLyakh commented 4 years ago

This is what I see as the latest commit (Feb 11): commit 96ed63026033271a1d253776e3fbc8a573f91d10 Author: Nathan Hjelm hjelmn@google.com Date: Tue Feb 11 21:57:24 2020 -0800

osc/rdma: modify attach to check for region overlap

This commit addresses two issues in osc/rdma:

 1) It is erroneous to attach regions that overlap. This was being
    allowed but the standard does not allow overlapping attachments.

 2) Overlapping registration regions (4k alignment of attachments)
    appear to be allowed. Add attachment bases to the bookeeping
    structure so we can keep better track of what can be detached.

It is possible that the standard did not intend to allow #2. If that
is the case then #2 should fail in the same way as #1. There should
be no technical reason to disallow #2 at this time.

References #7384

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
hjelmn commented 4 years ago

Forced pushed the branch. Just reclone it. You will probably need to set the max attach to 256 or higher for your app.

DmitryLyakh commented 4 years ago

I re-cloned https://github.com/hjelmn/ompi.git, checked out branch osc_rdma_allow_overlapping_registration_regions_and_return_the_correct_error_code_when_regions_overlap, commit 96ed63026033271a1d253776e3fbc8a573f91d10. And it produces exactly the same crash in MPI_Win_detach as before (below) on my Ubuntu 16.04 laptop. Any ideas? Printing scalar etens ... etens()[] 0.10668566847503D+15 Ok: 0.1063 sec Retrieving directly scalar etens ... Ok: Value = ( 0.10668566847503D+15 0.00000000000000D+00): 0.0560 sec Retrieving directly tensor dtens ... Ok: Norm = 0.10668566847502D+15: 0.3408 sec Destroying tensor rtens ... [Incredible:00000] An error occurred in MPI_Win_detach [Incredible:00000] reported by process [3374317570,1] [Incredible:00000] on win rdma window 5 [Incredible:00000] MPI_ERR_UNKNOWN: unknown error [Incredible:00000] MPI_ERRORS_ARE_FATAL (processes in this win will now abort, [Incredible:00000] and potentially your MPI job) [Incredible:00000] An error occurred in MPI_Win_detach [Incredible:00000] reported by process [3374317570,0] [Incredible:00000] on win rdma window 5 [Incredible:00000] MPI_ERR_UNKNOWN: unknown error [Incredible:00000] MPI_ERRORS_ARE_FATAL (processes in this win will now abort, [Incredible:00000] and potentially your MPI job) [Incredible:23436] PRUN: EVHANDLER WITH STATUS PMIX_ERR_JOB_TERMINATED(-145) [Incredible:23436] JOB [51488,2] COMPLETED WITH STATUS -1 [Incredible:23436] PRUN: INFOCB [Incredible:23436] PRUN: EVHANDLER WITH STATUS LOST_CONNECTION_TO_SERVER(-101)

DmitryLyakh commented 4 years ago

I tried both 128 and 256 max limit, no difference ... However, Thomas somehow made it work on hist desktop. I will try on Summit as well later, but on my laptop I observe no difference, it always crashes the same way as originally reported ...

DmitryLyakh commented 4 years ago

This is the mpiexec command I used on my laptop: /usr/local/mpi/openmpi/openmpi-hjelmn/bin/mpiexec -np 4 -npernode 4 --hostfile hostfile --verbose --mca mpi_abort_print_stack 1 --mca osc_rdma_max_attach 256 ./exec.sh

hostfile: localhost slots=4

naughtont3 commented 4 years ago

Yes, I think that his the same scenario b/c I see MPI_ERR_UNKNOWN, whereas on my desktop if I exclude the --mca osc_rdma_max_attach 128 it fails during MPI_Win_attach with error MPI_ERR_RMA_ATTACH.

hjelmn commented 4 years ago

Well, I can no longer reproduce the issue. I committed the changes to master so you can go ahead and try that and see what you get.

DmitryLyakh commented 4 years ago

Do you mean the test code I provided runs to completion without a crash in MPI_Win_detach() in your case? I built and tested the latest commit (below) from github.com/hjelmn/ompi and still getting the same crash on my Ubuntu 16.04 machine ...

commit 54c8233f4f670ee43e59d95316b8dc68f8258ba0 Author: Nathan Hjelm hjelmn@google.com Date: Sun Feb 16 17:09:20 2020 -0800

osc/rdma: bump the default max dynamic attachments to 64

This commit increaes the osc_rdma_max_attach variable from 32
to 64. The new default is kept low due to the small number
of registration resources on some systems (Cray Aries). A
larger max attachement value can be set by the user on other
systems.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
DmitryLyakh commented 4 years ago

Thomas, does the latest commit pass the test on your desktop?

naughtont3 commented 4 years ago

Yes on my desktop (not tested on Summit yet).

I pulled OMPI master with Nathan's changes merged and rebuilt on my desktop.

     beaker:$ gcc --version | head -1
     gcc (Spack GCC) 8.1.0

     beaker:$  ../configure \
            --enable-mpirun-prefix-by-default \
            --enable-debug \
            --prefix=$PWD/_install \
         && make \
         && make install
    beaker:$ git remote -v
    origin  https://gitlab.com/DmitryLyakh/ExaTensor.git (fetch)
    origin  https://gitlab.com/DmitryLyakh/ExaTensor.git (push)
    beaker:$ git br
    master
    * openmpi_fail
    beaker:$ git log --oneline | head -2
    bf8a46e Prepared the Makefile for reproducing the OpenMPI crash in MPI_Win_detach().
    68e2a37 Added ddss_flush_all() in DDSS.

    mpirun \
        --np 4 \
        --mca osc rdma \
        --mca mpi_abort_print_stack 1 \
        --mca osc_rdma_max_attach 128 \
        -x OMP_NUM_THREADS=8 \
        -x QF_NUM_PROCS=4 \
        -x QF_PROCS_PER_NODE=4 \
        -x QF_CORES_PER_PROCESS=1 \
        -x QF_MEM_PER_PROCESS=1024 \
        -x QF_NVMEM_PER_PROCESS=0 \
        -x QF_HOST_BUFFER_SIZE=1024 \
        -x QF_GPUS_PER_PROCESS=0 \
        -x QF_MICS_PER_PROCESS=0 \
        -x QF_AMDS_PER_PROCESS=0 \
        -x QF_NUM_THREADS=1 \
        ./Qforce.x
hjelmn commented 4 years ago

Looks to me like it runs to completion. I am running with Open MPI master with only setting the osc_rdma_max_attach MCA variable to 1024 (just to be safe).

DmitryLyakh commented 4 years ago

Just in case, did you build OpenMPI in Debug or Release mode?

hjelmn commented 4 years ago

debug mode. shouldn't make a difference but I can try again in optimized mode.

hjelmn commented 4 years ago

shouldn't but is. huh

hjelmn commented 4 years ago

ok, I see the issue. had the call inside an assert and we optimize assert out in non-debug builds. fixing.

naughtont3 commented 4 years ago

See also PR #7421

naughtont3 commented 4 years ago

@DmitryLyakh I think I worked through most of the issues I was hitting on Summit. I am now able to run your reproducer w/o error on Summit using the "DEVELOP" openmpi/master build. This is using gcc/8.1.1 module and ompi master at 960c5f7. Please, give this a shot and see how things are working for you.

DmitryLyakh commented 4 years ago

Confirmed on my desktop: The OpenMPI master branch works fine in the Release mode after PR #7421

DmitryLyakh commented 4 years ago

Thanks for fixing this! I guess this issue can be closed now.

hppritcha commented 4 years ago

@naughtont3 could you boil this down to a small reproducer we can put into the ibm test suite?

hjelmn commented 4 years ago

I can add one. Just need to attach and detach a bunch.

naughtont3 commented 4 years ago

@hppritcha OK, I'll sync with @DmitryLyakh and get it into a test batch.

naughtont3 commented 4 years ago

@hjelmn if you can write simple unit test, that would be easier than having full application case. i'll try to get a version of @DmitryLyakh code into a test somewhere, but your unit test would be an easier case for most folks to quickly test. Thx

jsquyres commented 4 years ago

@naughtont3 @hjelmn Where are we on this issue? Did @cniethammer's cherry pick fix the issue on v4.0.x and/or v4.1.x?

hppritcha commented 3 years ago

@naughtont3 could you check if this is still a problem master and release branches?

hppritcha commented 3 years ago

closing. reopen if this is still a problem.