open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.08k stars 845 forks source link

osc/rdma: segfault in MPI_Compare_and_swap with flat MPI #9146

Open s417-lama opened 3 years ago

s417-lama commented 3 years ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

The current master branch: 65ca64f34e486b32be986f28356f8b0d0e3539ac

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From a git clone, as follows:

$ git clone https://github.com/open-mpi/ompi.git
$ cd ompi/
$ git submodule update --init --recursive
$ ./autogen.pl
$ mkdir build
$ cd build/
$ ../configure --prefix=<install_path> --with-ucx=<path_to_ucx> --disable-man-pages
$ make -j
$ make install

UCX v1.10.1 was built from a tarball.

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

$ git submodule status
 256b1f5dec15386990b57c7fc4c7ecd67a6f1e27 3rd-party/openpmix (v1.1.3-3014-g256b1f5)
 53e80245ad007550aee18c3fd176e030a173a16b 3rd-party/prrte (dev-31257-g53e8024)

Please describe the system on which you are running


Details of the problem

When calling MPI_Compare_and_swap() in "flat MPI" model, where multiple nodes are used and multiple processes are running on each node, it causes segfault with rdma osc.

Segfault did not occur with a single node or with multiple nodes having one process per node.

Minimum code example to reproduce segfault:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <mpi.h>

int main(int argc, char** argv) {
  MPI_Init(&argc, &argv);

  uint64_t* lock;

  MPI_Win win;
  MPI_Win_allocate(sizeof(uint64_t), 1, MPI_INFO_NULL, MPI_COMM_WORLD, &lock, &win);
  MPI_Win_lock_all(0, win);

  *lock = 0;

  MPI_Barrier(MPI_COMM_WORLD);

  const uint64_t one = 1;
  const uint64_t zero = 0;
  uint64_t result;
  MPI_Compare_and_swap(&one, &zero, &result, MPI_UINT64_T, 0, 0, win);
  MPI_Win_flush(0, win);

  printf("%ld\n", result);

  MPI_Barrier(MPI_COMM_WORLD);

  MPI_Win_unlock_all(win);
  MPI_Finalize();
  return 0;
}

This program first initializes lock as 0, and then all processes issue MPI_Compare_and_swap() to lock at rank 0. Expected behavior is that only one process gets result = 0.

Running the above program with 4 processes on 2 nodes:

$ mpirun --mca osc rdma -n 4 -N 2 ./a.out

Output:

[cx0001:24799:0:24799] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[cx0001:24800:0:24800] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
==== backtrace (tid:  24800) ====
 0 0x00000000000587b5 ucs_debug_print_backtrace()  <HOME>/ucx-1.10.1/build/src/ucs/../../../src/ucs/debug/debug.c:656
 1 0x00000000000b9e05 mca_btl_ofi_afop()  ???:0
 2 0x000000000023f176 ompi_osc_rdma_lock_all_atomic()  ???:0
 3 0x00000000000f81c6 MPI_Win_lock_all()  ???:0
 4 0x00000000004009f1 main()  test_cas.c:13
 5 0x00000000000223d5 __libc_start_main()  ???:0
 6 0x00000000004008e9 _start()  ???:0
=================================

a.out:24800 terminated with signal 11 at PC=2b59ed535e05 SP=7fffcfd3bb00.  Backtrace:
<ompi_install_path>/lib/libopen-pal.so.0(mca_btl_ofi_afop+0x105)[0x2b59ed535e05]
<ompi_install_path>/lib/libmpi.so.0(ompi_osc_rdma_lock_all_atomic+0x326)[0x2b59ecb77176]
<ompi_install_path>/lib/libmpi.so.0(PMPI_Win_lock_all+0x96)[0x2b59eca301c6]
./a.out[0x4009f1]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b59ed0d13d5]
./a.out[0x4008e9]
==== backtrace (tid:  24799) ====
 0 0x00000000000587b5 ucs_debug_print_backtrace()  <HOME>/ucx-1.10.1/build/src/ucs/../../../src/ucs/debug/debug.c:656
 1 0x00000000000b9e05 mca_btl_ofi_afop()  ???:0
 2 0x000000000023f176 ompi_osc_rdma_lock_all_atomic()  ???:0
 3 0x00000000000f81c6 MPI_Win_lock_all()  ???:0
 4 0x00000000004009f1 main()  test_cas.c:13
 5 0x00000000000223d5 __libc_start_main()  ???:0
 6 0x00000000004008e9 _start()  ???:0
=================================

a.out:24799 terminated with signal 11 at PC=2b3c2c582e05 SP=7ffe75a10190.  Backtrace:
<ompi_install_path>/lib/libopen-pal.so.0(mca_btl_ofi_afop+0x105)[0x2b3c2c582e05]
<ompi_install_path>/lib/libmpi.so.0(ompi_osc_rdma_lock_all_atomic+0x326)[0x2b3c2bbc4176]
<ompi_install_path>/lib/libmpi.so.0(PMPI_Win_lock_all+0x96)[0x2b3c2ba7d1c6]
./a.out[0x4009f1]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b3c2c11e3d5]
./a.out[0x4008e9]

Running with -n 4 -N 1 (one process per node) and -n 4 -N 4 (only one node) did not cause segfault.

devreal commented 3 years ago

The problem seems to be in MPI_Win_lock_all, not MPI_Compare_and_swap. Just out of curiosity: why are you building against UCX and then use osc/rdma? Does it work when running with --mca osc ucx?

s417-lama commented 3 years ago

You're right. I got confused, because I was during investigation of segfault in MPI_Compare_and_swap() with another version of Open MPI (v4.1.1). Seems like segfault in MPI_Compare_and_swap() is resolved in the latest version, but another issue arised in MPI_Win_lock_all.

why are you building against UCX and then use osc/rdma?

This is because I wanted to compare their performance.

Does it work when running with --mca osc ucx?

It did work with --mca osc ucx, but not with --mca osc rdma.

hjelmn commented 3 years ago

Hmmm btl/ofi was used. Will work on ensuring that when osc/rdma is used that btl/uct is selected.

devreal commented 3 years ago

@hjelmn Shouldn't btl/ofi be the btl to use on Omni-Path systems? To enable btl/uct, I have to run with --mca btl_uct_memory_domains all as otherwise btl/uct bails out.