Open herter4171 opened 4 years ago
Just realized that MPICH_PORT_RANGE
is to control "MPICH"'s behavior rather than MPIEXEC
s behavior. I know this is a bit confusing. There are two different sets of connections. First, the process manager needs to connect to control processes. This connection is needed for coordinating process management. Then once the processes are launched, each MPI processes need to connect to each other independently for MPI communication. While sounds redundant, the separation allows a different focus. the process manager interface layer is focused on reliability while MPICH library is more focused on performance.
The documentation definitely needs update/correction.
It may be cleaner to let hydra accept an option -portrange low:high
, and set appropriate netmod portrange variables. With libfabric, each providers have their own variables. It'll be a mess trying to explain all these to the user.
Hello,
I've just spent a good chunk of time trying to figure out why my processes in a container context were using random, incorrect ports to message back to rank zero in spite of me specifying
MPIEXEC_PORT_RANGE=2123:2127
. Running this sample code yields the following.What I've ultimately found is that if I, instead, use
MPICH_PORT_RANGE=2123:2127
, the output is what it should be.I know next to nothing about MPI, but going off of the fact that
MPICH_PORT_RANGE
succeeds whereMPIEXEC_PORT_RANGE
fails in this case, it seems incorrect to have the docs say these two environment variables accomplish the same thing.Feel free to close this whenever. I just wanted to offer some feedback so that, hopefully, nobody else loses time on this distinction.