open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

PMIX error running with 4.0.7rc1. #9595

Closed mwheinz closed 3 years ago

mwheinz commented 3 years ago

Trying to take 4.0.7rc1 for a spin but I seem to be having trouble on one of my machines (but not the other) Unfortunately, the error message isn't giving me a hint on why the machines are behaving differently.

Open MPI is configured to use all-internal libraries:

Configure command line: '--prefix=/usr/mpi/gcc/openmpi-expr' '--with-hwloc=internal' '--with-libevent=internal' '--with-pmix=internal' '--with-psm2' '--with-ofi' '--without-ucx' '--without-openib' '--without-verbs'

Result on the "bad" machine is:

[cn-priv-02:~](N/A)$ /usr/mpi/gcc/openmpi-expr/bin/mpirun --mca pmix_base_verbose 99 --mca mtl ofi --mca osc pt2pt --mca btl ^openib --mca pml ^ucx --map-by :OVERSUBSCRIBE --np 2 --host cn-priv-02,cn-priv-02 /home/mheinz/work/imb-ompi/IMB-MPI1 -include Uniband,Biband -mem 0.4 -time 4000 -npmin 2 -iter 1000 -mem 0.4    
[cn-priv-02:131551] mca: base: components_register: registering framework pmix components
[cn-priv-02:131551] mca: base: components_register: found loaded component flux
[cn-priv-02:131551] mca: base: components_register: component flux register function successful
[cn-priv-02:131551] mca: base: components_register: found loaded component pmix3x
[cn-priv-02:131551] mca: base: components_register: component pmix3x register function successful
[cn-priv-02:131551] mca: base: components_open: opening pmix components
[cn-priv-02:131551] mca: base: components_open: found loaded component flux
[cn-priv-02:131551] mca: base: components_open: found loaded component pmix3x
[cn-priv-02:131551] mca: base: components_open: component pmix3x open function successful
[cn-priv-02:131551] mca:base:select: Auto-selecting pmix components
[cn-priv-02:131551] mca:base:select:( pmix) Querying component [flux]
[cn-priv-02:131551] mca:base:select:( pmix) Querying component [pmix3x]
[cn-priv-02:131551] mca:base:select:( pmix) Query of component [pmix3x] set priority to 5
[cn-priv-02:131551] mca:base:select:( pmix) Selected component [pmix3x]
[cn-priv-02:131551] mca: base: close: unloading component flux
[cn-priv-02:131551] psquash: flex128 init
[cn-priv-02:131551] psquash: native init
[cn-priv-02:131551] psquash: flex128 init
[cn-priv-02:131551] HASH:STORE rank 0 key pmix.srvr.uri
[cn-priv-02:131551] PMIX server errreg_cbfunc - error handler registered status=0, reference=0
[cn-priv-02:131551] HASH:STORE rank 0 key opal.puri
[cn-priv-02:131551] HASH:STORE rank 0 key opal.puri
[cn-priv-02:131551] HASH:STORE rank 0 key opal.puri
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.srv.nspace
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.srv.rank
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.jobid
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.offset
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.nodeid
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.num.nodes
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.univ.size
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.job.size
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.job.napps
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.max.size
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.toposig
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.pmem
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.mapby
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.rankby
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.bindto
[cn-priv-02:131551] HASH:STORE rank 0 key pmix.locstr
[cn-priv-02:131551] HASH:STORE rank 0 key pmix.cpuset
[cn-priv-02:131551] HASH:STORE rank 0 key pmix.grank
[cn-priv-02:131551] HASH:STORE rank 0 key pmix.lrank
[cn-priv-02:131551] HASH:STORE rank 0 key pmix.nrank
[cn-priv-02:131551] HASH:STORE rank 0 key pmix.nodeid
[cn-priv-02:131551] HASH:STORE rank 0 key pmix.hname
[cn-priv-02:131551] HASH:STORE rank 0 key pmix.appnum
[cn-priv-02:131551] HASH:STORE rank 1 key pmix.locstr
[cn-priv-02:131551] HASH:STORE rank 1 key pmix.cpuset
[cn-priv-02:131551] HASH:STORE rank 1 key pmix.grank
[cn-priv-02:131551] HASH:STORE rank 1 key pmix.lrank
[cn-priv-02:131551] HASH:STORE rank 1 key pmix.nrank
[cn-priv-02:131551] HASH:STORE rank 1 key pmix.nodeid
[cn-priv-02:131551] HASH:STORE rank 1 key pmix.hname
[cn-priv-02:131551] HASH:STORE rank 1 key pmix.appnum
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.srvr.tmpdir
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.sing.listnr
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.srv.monitor
[cn-priv-02:131551] HASH:STORE rank 0 key pmix.hname
[cn-priv-02:131551] HASH:STORE rank 1 key pmix.hname
[cn-priv-02:131551] HASH:STORE rank -2 key pmix.nlist
[cn-priv-02:131551] PMIX ERROR: TAKE-NEXT-OPTION in file server/pmix_server.c at line 1450
[cn-priv-02:131551] PMIX ERROR: TAKE-NEXT-OPTION in file server/pmix_server.c at line 1450
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an
error:

Error code: -1366
Error name: (null)
Node: cn-priv-02

when attempting to start process rank 0.
--------------------------------------------------------------------------
2 total processes failed to start
[cn-priv-02:131551] psquash: flex128 finalize
[cn-priv-02:131551] sys call unlink(2) fail
[cn-priv-02:131551] mca: base: close: component pmix3x closed
[cn-priv-02:131551] mca: base: close: unloading component pmix3x
mwheinz commented 3 years ago

Figured it out - there were apparently some old binaries lurking in /usr/mpi/gcc/openmpi-expr on the one machine.