open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 860 forks source link

Error with combination of MPI_Win_create_dynamic() and MPI_Get() using osc rdma #10328

Open jotabf opened 2 years ago

jotabf commented 2 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

$ ompi_info
               Package: Open MPI haroldo@headnode0 Distribution
                Open MPI: 4.1.1
  Open MPI repo revision: v4.1.1
   Open MPI release date: Apr 24, 2021
                Open RTE: 4.1.1
  Open RTE repo revision: v4.1.1
   Open RTE release date: Apr 24, 2021
                    OPAL: 4.1.1
      OPAL repo revision: v4.1.1
       OPAL release date: Apr 24, 2021
                 MPI API: 3.1.0
            Ident string: 4.1.1
                  Prefix: /opt/npad/shared/libraries/openmpi/4.1.1-gnu-8
 Configured architecture: x86_64-pc-linux-gnu
          Configure host: headnode0
           Configured by: haroldo
           Configured on: Wed Dec  8 01:32:00 UTC 2021
          Configure host: headnode0
  Configure command line: '--prefix=/opt/npad/shared/libraries/openmpi/4.1.1-gnu-8'
                          '--with-slurm' '--with-pmi' '--with-verbs'
                          '--with-ucx' '--enable-openib-rdmacm-ibaddr'
                Built by: haroldo
                Built on: Wed Dec  8 01:37:44 UTC 2021
              Built host: headnode0
              C bindings: yes
            C++ bindings: no
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the gfortran compiler and/or Open
                          MPI, does not support the following: array
                          subsections, direct passthru (where possible) to
                          underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
      C compiler version: 8.5.0
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /usr/bin/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
          Fort PROTECTED: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
           C++ profiling: no
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
      MPI1 compatibility: no
          MPI extensions: affinity, cuda, pcollreq
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no

Please describe the system on which you are running


Details of the problem

Hello everyone,

I'm trying to run a mpi code with MPI_Get() using a Dynamic Windown (MPI_Win_create_dynamic + MPI_Win_attach) I have two scenarios: The first am trying to run my code with the following configurations

btl_openib_allow_ib = 1
btl_openib_if_include = mlx4_0:1
osc = rdma 
orte_base_help_aggregate = 0 

with this configuration, I have the problem that the execution got stuck in the MPI_Get(), sometimes generations the following error

[service3:92500] *** An error occurred in MPI_Rget
[service3:92500] *** reported by process [415891457,0]
[service3:92500] *** on win rdma window 4
[service3:92500] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[service3:92500] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[service3:92500] ***    and potentially your MPI job)

I can solve this change the changing osc = ucx. The code run ok, but with this, I return to another problem that I have described here (https://github.com/open-mpi/ompi/issues/9580).

Then, my idea is to solve the problem with MPI_Get() using osc = rdma. I'm using the following code to test the MPI_Get()

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[]) {
    MPI_Init(&argc, &argv);

    int size, rank;
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    int *value = new int[1];
    int get_value;
    MPI_Aint *addr;
    MPI_Aint get_addr;
    MPI_Win win_addr;
    MPI_Win win_value;

    MPI_Win_allocate(sizeof(MPI_Aint), sizeof(MPI_Aint), MPI_INFO_NULL, MPI_COMM_WORLD, &(addr), &(win_addr));
    MPI_Win_create_dynamic(MPI_INFO_NULL, MPI_COMM_WORLD, &win_value);
    MPI_Win_attach(win_value, value, sizeof(int));

    if (rank == 0) {
        MPI_Win_lock(MPI_LOCK_SHARED, rank, 0, win_value);
        MPI_Win_lock(MPI_LOCK_EXCLUSIVE, rank, 0, win_addr);
        *value = 3366;
        MPI_Get_address(value, addr);
        printf("ID %i Dynamic Windown with Value = %i and Addr %lu\n", rank, *value, *addr);
        MPI_Win_unlock(rank, win_addr);
        MPI_Win_unlock(rank, win_value);
    }

    MPI_Barrier(MPI_COMM_WORLD);

    if (rank != 0) {
        printf("ID %i MPI_Get Addr ...\n", rank);
        MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win_addr);
        MPI_Get(&get_addr, 1, MPI_AINT, 0, 0, 1, MPI_AINT, win_addr);
        MPI_Win_unlock(0, win_addr);

        printf("ID %i MPI_Get Value ...\n", rank);
        MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win_value);
        MPI_Get(&get_value, 1, MPI_INT, 0, get_addr, 1, MPI_INT, win_value);
        MPI_Win_unlock(0, win_value);

        printf("ID %i MPI_Get completed on Dynamic Windown with Value = %i and Addr %lu\n", rank, get_value, get_addr);
    }

    MPI_Win_free(&win_addr);
    MPI_Win_detach(win_value, value);
    MPI_Finalize();

    return EXIT_SUCCESS;
}

I'm using a combination of MPI_Win_create_dynamic() and MPI_Get(), wherein in this case I use the target memory address as the displacement parameter in MPI_Get(). Repeating, this works with ucx but not with rdma. I think the rdma MPI_Get() function does not do the proper handling of the displacement value when this means the target address.

benmenadue commented 2 years ago

A user on our cluster has hit a very similar issue with v4.1.3, but using MPI_Win_allocate and MPI_Fetch_and_op, and only when the two ranks have different sized windows (everything but rank 0 has a zero-sized window in their example). As with here, it fails with osc/rdma but works with osc/ucx.

Running with an --enable-debug build gave the attached mpi_fetch_test.log (the mpirun command is in that log file), but the key line I found was this:

[1,1]<stderr>:[gadi-login-07.gadi.nci.org.au:1091310] remote address range 0x7f2558006008 - 0x7f255800600c is out of range. Valid address range is 0x7f2558006008 - 0x7f2558006008 (0 bytes)

I.e. rank 1 thinks that the window size it's getting from (i.e. on rank 0) is 0 bytes, even though it's non-zero. My thoughts were that it was using the local window's size rather of the remote size, but I wasn't able to see how that might come about.

I've attached a test program (mpi_fetch_test.txt, just rename to .f90 first - GitHub doesn't like that extension) that shows the failure, it can be compiled and run with a simple

mpif90 mpi_fetch_test.f90
mpirun -np 2 ./a.out
jotabf commented 2 years ago

I was running the same code with --enable-debug (on openmpi installation) and it was generating an infinite loop with this output:

mpirun -n 2 -mca osc rdma -mca btl_openib_allow_ib 1 --mca osc_base_verbose 100 ./test.o

[...]
[r1i3n3:216755] releasing shared lock 303cc70 on peer 0. value 0xffffffffffffffff
[r1i3n3:216755] allocating frag. pending = 1
[r1i3n3:216755] allocating frag. pending = 2
[r1i3n3:216755] pending atomic 0x25d0c80 complete with status 0
[r1i3n3:216755] returning frag. pending = 3
[r1i3n3:216755] pending atomic 0x2580e80 complete with status 0
[r1i3n3:216755] returning frag. pending = 2
[r1i3n3:216755] shared lock incremented. old value 0x8000000000000000
[r1i3n3:216755] another peer has exclusive access to lock
[r1i3n3:216755] releasing shared lock 303cc70 on peer 0. value 0xffffffffffffffff
[r1i3n3:216755] allocating frag. pending = 1
[r1i3n3:216755] allocating frag. pending = 2
[r1i3n3:216755] pending atomic 0x2580e80 complete with status 0
[r1i3n3:216755] returning frag. pending = 3
[r1i3n3:216755] pending atomic 0x25d0c80 complete with status 0
[r1i3n3:216755] returning frag. pending = 2
[r1i3n3:216755] shared lock incremented. old value 0x8000000000000000
[r1i3n3:216755] another peer has exclusive access to lock
[r1i3n3:216755] releasing shared lock 303cc70 on peer 0. value 0xffffffffffffffff
[r1i3n3:216755] allocating frag. pending = 1
[r1i3n3:216755] allocating frag. pending = 2
[r1i3n3:216755] pending atomic 0x25d0c80 complete with status 0
[r1i3n3:216755] returning frag. pending = 3
[r1i3n3:216755] pending atomic 0x2580e80 complete with status 0
[r1i3n3:216755] returning frag. pending = 2
[r1i3n3:216755] shared lock incremented. old value 0x8000000000000000
[r1i3n3:216755] another peer has exclusive access to lock
[r1i3n3:216755] releasing shared lock 303cc70 on peer 0. value 0xffffffffffffffff
[r1i3n3:216755] allocating frag. pending = 1
[r1i3n3:216755] allocating frag. pending = 2
[r1i3n3:216`755] pending atomic 0x2580e80 complete with status 0
[r1i3n3:216755] returning frag. pending = 3
[r1i3n3:216755] pending atomic 0x25d0c80 complete with status 0
[r1i3n3:216755] returning frag. pending = 2
[r1i3n3:216755] shared lock incremented. old value 0x8000000000000000
[r1i3n3:216755] another peer has exclusive access to lock
[r1i3n3:216755] releasing shared lock 303cc70 on peer 0. value 0xffffffffffffffff
[r1i3n3:216755] allocating frag. pending = 1
[r1i3n3:216755] allocating frag. pending = 2
[r1i3n3:216755] pending atomic 0x25d0c80 complete with status 0
[r1i3n3:216755] returning frag. pending = 3
[r1i3n3:216755] pending atomic 0x2580e80 complete with status 0
[r1i3n3:216755] returning frag. pending = 2
[r1i3n3:216755] shared lock incremented. old value 0x8000000000000000
[r1i3n3:216755] another peer has exclusive access to lock
[r1i3n3:216755] releasing shared lock 303cc70 on peer 0. value 0xffffffffffffffff
[r1i3n3:216755] allocating frag. pending = 1
[r1i3n3:216755] allocating frag. pending = 2
[r1i3n3:216755] pending atomic 0x2580e80 complete with status 0
[r1i3n3:216755] returning frag. pending = 3
[r1i3n3:216755] pending atomic 0x25d0c80 complete with status 0
[r1i3n3:216755] returning frag. pending = 2
[r1i3n3:216755] shared lock incremented. old value 0x8000000000000000
[r1i3n3:216755] another peer has exclusive access to lock
[r1i3n3:216755] releasing shared lock 303cc70 on peer 0. value 0xffffffffffffffff
[r1i3n3:216755] allocating frag. pending = 1
[r1i3n3:216755] allocating frag. pending = 2
[r1i3n3:216755] pending atomic 0x25d0c80 complete with status 0
[r1i3n3:216755] returning frag. pending = 3
[r1i3n3:216755] pending atomic 0x2580e80 complete with status 0
[r1i3n3:216755] returning frag. pending = 2
[r1i3n3:216755] shared lock incremented. old value 0x8000000000000000
[r1i3n3:216755] another peer has exclusive access to lock
[r1i3n3:216755] releasing shared lock 303cc70 on peer 0. value 0xffffffffffffffff
[r1i3n3:216755] allocating frag. pending = 1
[r1i3n3:216755] allocating frag. pending = 2
[r1i3n3:216755] pending atomic 0x2580e80 complete with status 0
[r1i3n3:216755] returning frag. pending = 3
[r1i3n3:216755] pending atomic 0x25d0c80 complete with status 0
[r1i3n3:216755] returning frag. pending = 2
[r1i3n3:216755] shared lock incremented. old value 0x8000000000000000
[r1i3n3:216755] another peer has exclusive access to lock
[r1i3n3:216755] releasing shared lock 303cc70 on peer 0. value 0xffffffffffffffff
[r1i3n3:216755] allocating frag. pending = 1
[r1i3n3:216755] allocating frag. pending = 2
[r1i3n3:216755] pending atomic 0x25d0c80 complete with status 0
[r1i3n3:216755] returning frag. pending = 3
[r1i3n3:216755] pending atomic 0x2580e80 complete with status 0
[r1i3n3:216755] returning frag. pending = 2
[r1i3n3:216755] shared lock incremented. old value 0x8000000000000000
[r1i3n3:216755] another peer has exclusive access to lock
[r1i3n3:216755] releasing shared lock 303cc70 on peer 0. value 0xffffffffffffffff
[r1i3n3:216755] allocating frag. pending = 1
[r1i3n3:216755] allocating frag. pending = 2
[r1i3n3:216755] pending atomic 0x2580e80 complete with status 0
[r1i3n3:216755] returning frag. pending = 3
[r1i3n3:216755] pending atomic 0x25d0c80 complete with status 0
[r1i3n3:216755] returning frag. pending = 2

complete_output.txt

hjelmn commented 2 years ago

Nevermind. I see what you are doing there. That should be valid.

hjelmn commented 2 years ago

Will dig into this tomorrow and see what is going on. I don't think much has changes with the dynamic window code in years and it was working.

hjelmn commented 2 years ago

Can you try with 5.0.0 or the main branch? I wonder if some fix did not make it back to 4.1.x.

devreal commented 2 years ago

I'm seeing similar errors with osc/rdma in the ARMCI test suite. Interestingly, I only see them on shared memory and not when running with one process per node. Will have to investigate.

devreal commented 2 years ago

@hjelmn Did you find anything useful?

devreal commented 2 years ago

Funny thing: I cannot reproduce this issue with the test case provided at the top. Will have to use the ARMCI tests...

devreal commented 2 years ago

@benmenadue I see a similar problem with windows allocated through MPI_Win_allocated and filed https://github.com/open-mpi/ompi/issues/10521. In short, using hcoll causes a problem on the machine I'm using and leads osc/rdma to believe that all processes allocate the same window size when they don't. One workaround is to disable hcoll by passing --mca coll ^hcoll to mpirun. Any chance you could give that a try?

devreal commented 2 years ago

@jotabf I think I can reproduce a hang in MPI_Get with osc/rdma and btl/openib. I have not had any success in debugging this though. However, support for openib has been deprecated in the 4.x series and was removed in the 5.x release. The replacement for IB systems is osc/ucx, which should be selected by default on 5.0.x on your system and works for me.

jotabf commented 2 years ago

@devreal I mentioned there, but not here. The PR #10413 solve my problems. For now, I'm using a local code that I modified waiting for the changing incorporate into some official version.

Yes, I was using osc/rdma and btl/openib when I had this problem. I still have a problem in my code using osc/ucx with some unexpected synchronization. Because of this, I have avoided using it. Is it not possible to use osc/rdma in 5.x?

devreal commented 2 years ago

I still have a problem in my code using osc/ucx with some unexpected synchronization.

Do you have the details for that? Could you open a ticket (if there isn't one already)?

Because of this, I have avoided using it. Is it not possible to use osc/rdma in 5.x?

Technically yes, using either btl/tcp (likely slow because no RDMA) or btl/uct (currently broken, see https://github.com/open-mpi/ompi/issues/10522). The implementation officially supported by Mellanox is osc/ucx.

jotabf commented 2 years ago

I saw the problem in the Issue in https://github.com/open-mpi/ompi/issues/9580 . Initially, my problem was another, but from this point https://github.com/open-mpi/ompi/issues/9580#issuecomment-954090542, I noticed that my issue with osc/ucx. If it is not clear, I can try to reproduce the problem and open a new issue. However, I understood that the issue was about an atomic operation that still does not has a solution.