open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.19k stars 865 forks source link

v4.1.x memory alignment issue on AVX support #7954

Closed zhngaj closed 4 years ago

zhngaj commented 4 years ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.x branch

git clone --recursive https://github.com/open-mpi/ompi.git
git checkout v4.1.x
bd16024a0b (HEAD -> v4.1.x, origin/v4.1.x) Merge pull request #7946 from gpaulsen/topic/v4.1.x/README_for_SLURM_binding

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

./configure --prefix=/fsx/ompi/install --with-libfabric=/fsx/libfabric/install --enable-debug

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

There's no output from git submodule status after I checked out v4.1.x branch. I can see its output with master branch though.

[ec2-user@ip-172-31-42-103 ompi]$ git submodule status
 4a43c39c89037f52b4e25927e58caf08f3707c33 opal/mca/hwloc/hwloc2/hwloc (hwloc-2.1.0rc2-53-g4a43c39c)
 b18f8ff3f62bd93f17e01c4230e7eca4a30aa204 opal/mca/pmix/pmix4x/openpmix (v1.1.3-2460-gb18f8ff3)
 17ace07e7d41f50d035dc42dfd6233f8802e4405 prrte (dev-30660-g17ace07e7d)

Please describe the system on which you are running


Details of the problem

  1. Built libfabric master (b01932dfb (HEAD -> master, origin/master, origin/HEAD) Merge pull request #6103 from wzamazon/efa_fix_readmsg)

    ./configure --prefix=/fsx/libfabric/install --enable-mrail --enable-tcp --enable-rxm --disable-rxd --disable-verbs --enable-efa=/usr --enable-debug
  2. Built Open MPI v4.1.x branch (bd16024a0b (HEAD -> v4.1.x, origin/v4.1.x) Merge pull request #7946 from gpaulsen/topic/v4.1.x/README_for_SLURM_binding) with libfabric

    ./configure --prefix=/fsx/ompi/install --with-libfabric=/fsx/libfabric/install --enable-debug
  3. Built intel-mpi-benchmark 2019 U6 (https://github.com/intel/mpi-benchmarks/tree/IMB-v2019.6) with Open MPI

  4. Ran IMB-EXT Accumulate test with 2 MPI processes, and hit the segfault.

    
    [ec2-user@ip-172-31-9-184 ~]$ mpirun --prefix /fsx/ompi/install  -n 2 -N 1 --mca btl ofi --mca osc rdma --mca btl_ofi_provider_include efa --hostfile /fsx/hosts -x PATH -x LD_LIBRARY_PATH /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/intel-mpi-benchmarks-2019.6-tnzpd3z7s4mgsvchsl2ofn2cgw3aonty/bin/IMB-EXT Accumulate  -npmin 2 -iter 200
    Warning: Permanently added 'ip-172-31-13-230,172.31.13.230' (ECDSA) to the list of known hosts.
    #------------------------------------------------------------
    #    Intel(R) MPI Benchmarks 2019 Update 6, MPI-2 part
    #------------------------------------------------------------
    # Date                  : Tue Jul 21 19:06:54 2020
    # Machine               : x86_64
    # System                : Linux
    # Release               : 4.14.165-103.209.amzn1.x86_64
    # Version               : #1 SMP Sun Feb 9 00:23:26 UTC 2020
    # MPI Version           : 3.1
    # MPI Thread Environment:

Calling sequence was:

/fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/intel-mpi-benchmarks-2019.6-tnzpd3z7s4mgsvchsl2ofn2cgw3aonty/bin/IMB-EXT Accumulate -npmin 2 -iter 200

Minimum message length in bytes: 0

Maximum message length in bytes: 4194304

#

MPI_Datatype : MPI_BYTE

MPI_Datatype for reductions : MPI_FLOAT

MPI_Op : MPI_SUM

# #

List of Benchmarks to run:

Accumulate

-----------------------------------------------------------------------------

Benchmarking Accumulate

processes = 2

-----------------------------------------------------------------------------

#

MODE: AGGREGATE

#

bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] defects

        0          200         0.04         0.22         0.13         0.00
        4          200         0.85         0.89         0.87         0.00
        8          200         0.77         0.83         0.80         0.00
       16          200         0.81         0.85         0.83         0.00

[ip-172-31-13-230:12036] Process received signal [ip-172-31-13-230:12036] Signal: Segmentation fault (11) [ip-172-31-13-230:12036] Signal code: (-6) [ip-172-31-13-230:12036] Failing at address: 0x1f400002f04 [ip-172-31-13-230:12036] [ 0] /lib64/libpthread.so.0(+0xf600)[0x7f4b6ba25600] [ip-172-31-13-230:12036] [ 1] /fsx/ompi/install/lib/openmpi/mca_op_avx.so(+0x3e738)[0x7f4b60f23738] [ip-172-31-13-230:12036] [ 2] /fsx/ompi/install/lib/openmpi/mca_osc_rdma.so(+0x9f5d)[0x7f4b51bd1f5d] [ip-172-31-13-230:12036] [ 3] /fsx/ompi/install/lib/openmpi/mca_osc_rdma.so(+0xc637)[0x7f4b51bd4637] [ip-172-31-13-230:12036] [ 4] /fsx/ompi/install/lib/openmpi/mca_osc_rdma.so(+0xca19)[0x7f4b51bd4a19] [ip-172-31-13-230:12036] [ 5] /fsx/ompi/install/lib/openmpi/mca_osc_rdma.so(+0xffbf)[0x7f4b51bd7fbf] [ip-172-31-13-230:12036] [ 6] /fsx/ompi/install/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_accumulate+0x10f)[0x7f4b51bd86ed] [ip-172-31-13-230:12036] [ 7] /fsx/ompi/install/lib/libmpi.so.40(PMPI_Accumulate+0x421)[0x7f4b6c5a8967] [ip-172-31-13-230:12036] [ 8] /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/intel-mpi-benchmarks-2019.6-tnzpd3z7s4mgsvchsl2ofn2cgw3aonty/bin/IMB-EXT[0x43da5c] [ip-172-31-13-230:12036] [ 9] /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/intel-mpi-benchmarks-2019.6-tnzpd3z7s4mgsvchsl2ofn2cgw3aonty/bin/IMB-EXT[0x42c3aa] [ip-172-31-13-230:12036] [10] /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/intel-mpi-benchmarks-2019.6-tnzpd3z7s4mgsvchsl2ofn2cgw3aonty/bin/IMB-EXT[0x432d44] [ip-172-31-13-230:12036] [11] /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/intel-mpi-benchmarks-2019.6-tnzpd3z7s4mgsvchsl2ofn2cgw3aonty/bin/IMB-EXT[0x405513] [ip-172-31-13-230:12036] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4b6b66a575] [ip-172-31-13-230:12036] [13] /fsx/SubspaceBenchmarks/spack/opt/spack/linux-amzn2018-x86_64/gcc-4.8.5/intel-mpi-benchmarks-2019.6-tnzpd3z7s4mgsvchsl2ofn2cgw3aonty/bin/IMB-EXT[0x403d29] [ip-172-31-13-230:12036] End of error message

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 1 with PID 12036 on node ip-172-31-13-230 exited on signal 11 (Segmentation fault).


5. Some notes

Checking the backtrace below, I found that the segfault is because Open MPI added the supports for MPI_OP using AVX512, AVX2 and MMX ([commit](https://github.com/open-mpi/ompi/commit/b4e04bbd8a5eafe6cd6854884025460253a50962)). When it's using _mm256_load_ps (or _mm512_load_ps) to load the single-precision floating-point elements, it requires the address must be 32-byte aligned (or 64-byte aligned).

In IMB-EXT, it uses MPI_Alloc_mem and MPI_Win_create to allocate source_buffer and Window, the returned address is not guaranteed to be the specific length aligned, which leads to segfault.

Instead, I checked out the commit right before the above commit, the segfault disappeared. I also tried with _mm256_loadu_ps (or _mm512_loadu_ps) which does not require memory aligned. The segfault also disappeared. 

**Is this a real issue in OMPI side or the apps (IMB here) are expected to provide the specific length aligned pointers?** 

0 0x00007f98385f8763 in _mm256_load_ps (__P=0xef70d0) at /usr/lib/gcc/x86_64-amazon-linux/7/include/avxintrin.h:873

873 return (__m256 )__P; Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.180.amzn1.x86_64 libibverbs-28.amzn0-1.amzn1.x86_64 libnl3-3.2.28-4.6.amzn1.x86_64 libpciaccess-0.13.1-4.1.11.amzn1.x86_64 zlib-1.2.8-7.18.amzn1.x86_64 (gdb) bt

0 0x00007f98385f8763 in _mm256_load_ps (__P=0xef70d0) at /usr/lib/gcc/x86_64-amazon-linux/7/include/avxintrin.h:873

1 ompi_op_avx_2buff_add_float_avx512 (_in=0xef70a0, _out=0xef70d0, count=0x7fff324b9b94, dtype=0x7fff324b9b88, module=0xbcfb30)

at op_avx_functions.c:504

2 0x00007f983fb80a17 in ompi_op_reduce (op=0x661180 , source=0xef70a0, target=0xef70d0, count=8,

dtype=0x662e00 <ompi_mpi_float>) at ../../../ompi/op/op.h:581

3 0x00007f983fb8114a in ompi_osc_base_sndrcv_op (origin=0xef70a0, origin_count=8, origin_dt=0x662e00 ,

target=0xef70d0, target_count=8, target_dt=0x662e00 <ompi_mpi_float>, op=0x661180 <ompi_mpi_op_sum>)
at base/osc_base_obj_convert.c:178

4 0x00007f9824fd4293 in ompi_osc_rdma_gacc_local (source_buffer=0xef70a0, source_count=8,

source_datatype=0x662e00 <ompi_mpi_float>, result_buffer=0x0, result_count=0, result_datatype=0x0, peer=0xef7fa0,
target_address=15691984, target_handle=0x0, target_count=8, target_datatype=0x662e00 <ompi_mpi_float>,
op=0x661180 <ompi_mpi_op_sum>, module=0xee9110, request=0x0, lock_acquired=true) at osc_rdma_accumulate.c:143

5 0x00007f9824fd7f71 in ompi_osc_rdma_rget_accumulate_internal (sync=0xee9310, origin_addr=0xef70a0, origin_count=8,

origin_datatype=0x662e00 <ompi_mpi_float>, result_addr=0x0, result_count=0, result_datatype=0x0, peer=0xef7fa0, target_rank=0,
target_disp=0, target_count=8, target_datatype=0x662e00 <ompi_mpi_float>, op=0x661180 <ompi_mpi_op_sum>, request=0x0)
at osc_rdma_accumulate.c:934

6 0x00007f9824fd86ed in ompi_osc_rdma_accumulate (origin_addr=0xef70a0, origin_count=8,

origin_datatype=0x662e00 <ompi_mpi_float>, target_rank=0, target_disp=0, target_count=8,
target_datatype=0x662e00 <ompi_mpi_float>, op=0x661180 <ompi_mpi_op_sum>, win=0xeebed0) at osc_rdma_accumulate.c:1067

7 0x00007f983fb2b967 in PMPI_Accumulate (origin_addr=0xef70a0, origin_count=8, origin_datatype=0x662e00 ,

target_rank=0, target_disp=0, target_count=8, target_datatype=0x662e00 <ompi_mpi_float>, op=0x661180 <ompi_mpi_op_sum>,
win=0xeebed0) at paccumulate.c:130

8 0x000000000043da5c in IMB_accumulate ()

9 0x000000000042c3aa in Bmark_descr::IMB_init_buffers_iter(comm_info, iter_schedule, Bench, cmode, int, int) ()

10 0x0000000000432d44 in OriginalBenchmark<BenchmarkSuite<(benchmark_suite_t)4>, &IMB_accumulate>::run(scope_item const&) ()

11 0x0000000000405513 in main ()

(gdb) f 0

0 0x00007f98385f8763 in _mm256_load_ps (__P=0xef70d0) at /usr/lib/gcc/x86_64-amazon-linux/7/include/avxintrin.h:873

873 return (__m256 )__P; (gdb) p 0xef70d0 % 32 $2 = 16

bosilca commented 4 years ago

Thanks for the bug report and for the analysis. We should have not used the aligned access primitive for SSE. I have made a PR (#7957) and added a test case to validate unaligned memory accesses.

zhngaj commented 4 years ago

Thanks for the fix. With PR #7957, I did not hit the segfault with IMB-EXT.

rajachan commented 4 years ago

Going to reopen this to track the cherry-pick of https://github.com/open-mpi/ompi/pull/7957 into v4.1.x, which also has the MPI_OP support using vectorized instructions.

jsquyres commented 4 years ago

FYI: #7997 filed to cherry-pick this to v4.1.x branch.