open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.18k stars 861 forks source link

timeout always 14 seconds when MPIEXEC_TIMEOUT set to any value #12837

Open minrk opened 1 month ago

minrk commented 1 month ago

Please submit all the information below so that we can understand the working environment that is the context for your question.

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from conda-forge, build is:

./configure --prefix=$PREFIX \
            --disable-dependency-tracking \
            --disable-wrapper-runpath \
            --enable-mpi-fortran \
            --with-mpi-moduledir='${includedir}' \
            --with-sge \
            --with-hwloc=$PREFIX \
            --with-libevent=$PREFIX \
            --with-zlib=$PREFIX \
            --enable-mca-dso \
            --enable-ipv6

Please describe the system on which you are running

A conda environment can be reproduced in docker:

docker run --rm -it quay.io/condaforge/miniforge3
mamba install openmpi=5
MPIEXEC_TIMEOUT=1 mpiexec -n 2 --allow-run-as-root sleep 20

The same issue appears in debian experimental, also openmpi 5.0.5:

docker run --rm -it debian:experimental
apt update
apt -t experimental install openmpi-bin
MPIEXEC_TIMEOUT=1 mpiexec -n 2 --allow-run-as-root sleep 20

Details of the problem

This is truly strange, but when setting MPIEXEC_TIMEOUT to any value, the job timeout is actually 14 seconds:

$ MPIEXEC_TIMEOUT=1 mpiexec -n 2 sleep 20
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:

  Timeout: 14 seconds

The job will now be aborted.  Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option or MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------
$ MPIEXEC_TIMEOUT=1 mpiexec -n 2 --timeout 2 sleep 20
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:

  Timeout: 2 seconds

The job will now be aborted.  Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option or MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------

This is true for both arm and x86_64 on both mac and linux. Setting the timeout via the command-line mpiexec --timeout=1 works as expected.

openmpi 5.0.2 does not have this problem.

minrk commented 1 month ago

appears to be a regression only in the 3.0.x branch of prrte: https://github.com/openpmix/prrte/pull/2018