openpmix / prrte

PMIx Reference RunTime Environment (PRRTE)
https://pmix.org
Other
36 stars 67 forks source link

grpcomm/bmg: pmix_data_compress patch added error PACK-MISMATCH #961

Closed abouteiller closed 3 years ago

abouteiller commented 3 years ago

Thank you for taking the time to submit an issue!

Background information

when using PRTE with FT (e.g., with Open MPI mpiexec --with-ft ulfm), the error message /grpcomm_bmg_module.c:199] PMIx Error: PACK-MISMATCH is issued, and the application is aborted immediately.

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

b1adc909 (HEAD -> fix/bmg, mine/fix/bmg) detector: correct various segfault introduced with the NAMESPACE change Aurelien Bouteiller 14 minutes..
00a065fb PRTE_RETAIN cannot be used on pmix_data_buffer_t                                                        Aurelien Bouteiller 14 minutes..
9d8188f0 (origin/master, origin/HEAD, master) Merge pull request #956 from rhc54/topic/fnc                             Ralph Castain   7 days ago
What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)
1191ca59 (HEAD) Merge pull request #2189 from rhc54/topic/dedup                                                        Ralph Castain   8 days ago

Please describe the system on which you are running


Details of the problem

To replicate the issue, one can use Open MPI and ompi-tests-public

git submodule && autogen, etc. 
cd ompi-builddir
${ompi_srcdir}/configure && make install
git clone openmpi/ompi-tests-public
cd ompi-tests-public 
git submodule update --recursive 
cd ulfm-testing/api
salloc -N 2 ompi/master.debug//bin/mpiexec -N 2 -np 4 --with-ft mpi ./err_returns
### To run ALL FT tests:
#ULFM_PREFIX=${ompi_builddir} runtest.sh
abouteiller commented 3 years ago

@rhc54 I believe the observed behavior is a combination of two separate issues:

  1. the propagate RBCAST BMG message is not created/copied with the compression flags; I have tried to address this with 00a065fb but it appears insufficient; this should not abort the application, but would prevent error propagation from working when daemon failures are present
  2. In the test case, we have only MPI process failures, the reason we abort the app is
    [saturn:55357] [mpiexec-saturn-55357@0,0] errmgr:dvm: for proc [mpiexec-saturn-55357@0,2] state COMMUNICATION FAILURE
    [saturn:55357] [mpiexec-saturn-55357@0,0] Comm failure: daemons terminating - recording daemon [mpiexec-saturn-55357@0,2] as gone
    [saturn:55357] [mpiexec-saturn-55357@0,0] Comm failure: 1 routes remain alive

This change is probably from baec91f3 that removed the FORCED_TERMINATE macro, and now goes to inconditional cleanup, even if enable-recovery has been set.

abouteiller commented 3 years ago

Issue is now resolved in pr #960