open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.08k stars 845 forks source link

MTT ibm one-sided test failures for ompi v5.0.x #10244

Open shijin-aws opened 2 years ago

shijin-aws commented 2 years ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v5.0.x branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

./autogen.pl
./configure --prefix=<prefix> CFLAGS=-pipe --enable-picky --enable-debug --enable-mpi1-compatibility
make -j install

Part

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

[ec2-user@ip-172-31-8-95 ompi]$ git submodule status
 d3445c8fb15cfc4a03cfee27593ca1fe1a6d67ab 3rd-party/openpmix (v4.1.2-50-gd3445c8f)
 f3828e8307cf95d67a64eeaa4e36a362ac01e075 3rd-party/prrte (v2.0.2-71-gf3828e8307)

Please describe the system on which you are running


Details of the problem

There are around 40 ibm test suite failures for ompi v5.0.x with tcp path. Full test report can be found in this mtt report

wzamazon commented 2 years ago

There seems to be multiple issues. The following PR fixed one:

https://github.com/open-mpi/ompi/pull/10462

with this PR, 1sided pass.

wzamazon commented 2 years ago

Another PR

https://github.com/open-mpi/ompi/pull/10463

This fixed the segfault of pp_1sided and halo_1sided_put_alloc_mem

wzamazon commented 2 years ago

The hang with c_accumulate with efa turns out to be a bug in libfabric EFA installer. Fix is in https://github.com/ofiwg/libfabric/pull/7829. It will take a while for mtt to ingest the change.

wzamazon commented 2 years ago

Remaining issue are:

  1. c_put_dynamic_self/c_get_dynamic_set always hangs, even for 2 ranks.
  2. When btl/tcp is used, there are segfaults with c_get_accumulate_ddt1 and c_get_accumulate_ddt2
  3. When btl/tcp is used, c_accumulate is quite slow, not sure it is normal or not.
wzamazon commented 2 years ago

c_put_dynamic_self/c_get_dynamic_self hang will be fixed by PR https://github.com/open-mpi/ompi/pull/10473

wzamazon commented 2 years ago

Remaining issues:

With btl/ofi, mt_1sided segfault.

With btl/tcp,

  1. multiple tests (1sided, c_accumulate, etc) hang.
  2. c_get_accumulate_ddt1 and c_get_accumulate_ddt2 segfault.