open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.18k stars 861 forks source link

Jobs started with srun using cm/PSM2 fail #12886

Open mcarmesin opened 3 weeks ago

mcarmesin commented 3 weeks ago

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Open MPI was build from the released sources (https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.5.tar.bz2) using gcc 14.2 and the configured options ./configure --prefix=%{_prefix} \ --libdir=%{_libdir} \ --enable-shared \ --disable-heterogeneous \ --enable-prte-prefix-by-default \ --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda,coll-cuda,io-romio341 \ --with-show-load-errors=no \ --with-slurm \ --with-psm2 \ --with-pmix=internal \ --with-ucx=%{builddir}%{_prefix} \ --with-ofi=%{builddir}%{_prefix}

Further we use the manually built libfabric v1.21.0 and ucx v1.16.0.

Please describe the system on which you are running


Details of the problem

On our cluster with slurm 23.11.5, Open MPI 5.0 jobs (here the Open MPI ring example) fail inside a SLURM allocation, when started with srun. The same job runs successfully when started using mpirun. The PML is set to "cm" and MTL to "PSM2" using environment variables. Jobs with PML ucx or ob1 run successfully, but with higher latency.

$ salloc --tasks-per-node=5 -t 0:10:0
salloc: Granted job allocation YYYYY
salloc: Waiting for resource configuration
salloc: Nodes XXXX are ready for job
$ srun ./ring
slurmstepd: error: *** STEP YYYYY ON XXXXX CANCELLED AT 2024-10-28T15:49:29 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[XXXXX:858788] pmix_ptl_base: send_msg: write failed: Connection reset by peer (104) [sd = 32]
[XXXXX:858788] pmix_ptl_base: send_msg: write failed: Connection reset by peer (104) [sd = 35]
[XXXXX:858788] pmix_ptl_base: send_msg: write failed: Broken pipe (32) [sd = 26]
srun: error: XXXXX: tasks 0-4: Killed
srun: Terminating StepId=YYYYY.0
$ mpirun ./ring
Process 0 sending 10 to 1, tag 201 (5 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 3 exiting
Process 4 exiting
$
rhc54 commented 3 weeks ago

I'm not sure srun knows to create/pass the OmniPath security tokens - mpirun certainly does. Have you checked to see? This might just be the fabric refusing to send because it lacks the tokens.

mcarmesin commented 3 weeks ago

Using OpenMPI 4.0 PSM2 and srun are working, however using PMI2 instead of PMIX. As far as I understand there seems to be also some successfull communication on the PMIX level before the crash.

PMIX_MCA_ptl_base_verbose="90" srun  ./ring
[XXXXX:1304208] mca: base: component_find: searching NULL for ptl components
[XXXXX:1304208] mca: base: find_dyn_components: checking NULL for ptl components
[XXXXX:1304208] pmix:mca: base: components_register: registering framework ptl components
[XXXXX:1304208] pmix:mca: base: components_register: found loaded component client
[XXXXX:1304208] pmix:mca: base: components_register: component client has no register or open function
[XXXXX:1304208] pmix:mca: base: components_register: found loaded component server
[XXXXX:1304208] pmix:mca: base: components_register: component server has no register or open function
[XXXXX:1304208] pmix:mca: base: components_register: found loaded component tool
[XXXXX:1304208] pmix:mca: base: components_register: component tool has no register or open function
[XXXXX:1304208] mca: base: components_open: opening ptl components
[XXXXX:1304207] mca: base: component_find: searching NULL for ptl components
[XXXXX:1304207] mca: base: find_dyn_components: checking NULL for ptl components
[XXXXX:1304207] pmix:mca: base: components_register: registering framework ptl components
[XXXXX:1304207] pmix:mca: base: components_register: found loaded component client
[XXXXX:1304207] pmix:mca: base: components_register: component client has no register or open function
[XXXXX:1304207] pmix:mca: base: components_register: found loaded component server
[XXXXX:1304207] pmix:mca: base: components_register: component server has no register or open function
[XXXXX:1304207] pmix:mca: base: components_register: found loaded component tool
[XXXXX:1304207] pmix:mca: base: components_register: component tool has no register or open function
[XXXXX:1304207] mca: base: components_open: opening ptl components
[XXXXX:1304207] mca: base: components_open: found loaded component client
[XXXXX:1304207] mca: base: components_open: found loaded component server
[XXXXX:1304207] mca: base: components_open: found loaded component tool
[XXXXX:1304207] mca:ptl:select: checking available component client
[XXXXX:1304207] mca:ptl:select: checking available component server
[XXXXX:1304207] mca:ptl:select: checking available component tool
[XXXXX:1304207] mca:ptl:select: using component client
[XXXXX:1304207] posting notification recv on tag 0
[XXXXX:1304208] mca: base: components_open: found loaded component client
[XXXXX:1304208] mca: base: components_open: found loaded component server
[XXXXX:1304208] mca: base: components_open: found loaded component tool
[XXXXX:1304208] mca:ptl:select: checking available component client
[XXXXX:1304208] mca:ptl:select: checking available component server
[XXXXX:1304208] mca:ptl:select: checking available component tool
[XXXXX:1304208] mca:ptl:select: using component client
[XXXXX:1304208] posting notification recv on tag 0
[XXXXX:1304207] ptl:tcp: connecting to server
[XXXXX:1304208] ptl:tcp: connecting to server
[XXXXX:1304208] V3 SERVER DETECTED
[XXXXX:1304208] ptl:tcp:client attempt connect to pmix-server:1304198 at tcp4://127.0.0.1:48221
[XXXXX:1304207] V3 SERVER DETECTED
[XXXXX:1304207] ptl:tcp:client attempt connect to pmix-server:1304198 at tcp4://127.0.0.1:48221
[XXXXX:1304207] pmix:base setup connection to tcp4://127.0.0.1:48221
[XXXXX:1304208] pmix:base setup connection to tcp4://127.0.0.1:48221
[XXXXX:1304208] ptl_base_connect: attempting to connect to server
[XXXXX:1304208] pmix_ptl_base_connect: attempting to connect to server on socket 13
[XXXXX:1304207] ptl_base_connect: attempting to connect to server
[XXXXX:1304207] pmix_ptl_base_connect: attempting to connect to server on socket 13
[XXXXX:1304207] pmix:ptl SEND CONNECT ACK
[XXXXX:1304208] pmix:ptl SEND CONNECT ACK
[XXXXX:1304208] send blocking of 80 bytes to socket 13
[XXXXX:1304207] send blocking of 80 bytes to socket 13
[XXXXX:1304207] blocking send complete to socket 13
[XXXXX:1304207] pmix: RECV CONNECT ACK FROM SERVER
[XXXXX:1304207] waiting for blocking recv of 4 bytes
[XXXXX:1304208] blocking send complete to socket 13
[XXXXX:1304208] pmix: RECV CONNECT ACK FROM SERVER
[XXXXX:1304208] waiting for blocking recv of 4 bytes
[XXXXX:1304208] blocking receive complete from remote
[XXXXX:1304208] pmix: RECV CONNECT CONFIRMATION
[XXXXX:1304208] waiting for blocking recv of 4 bytes
[XXXXX:1304208] blocking receive complete from remote
[XXXXX:1304208] tcp_peer_try_connect: Connection across to peer pmix-server:1304198 succeeded
[XXXXX:1304207] blocking receive complete from remote
[XXXXX:1304207] pmix: RECV CONNECT CONFIRMATION
[XXXXX:1304207] waiting for blocking recv of 4 bytes
[XXXXX:1304207] blocking receive complete from remote
[XXXXX:1304207] tcp_peer_try_connect: Connection across to peer pmix-server:1304198 succeeded
[XXXXX:1304208] posting recv on tag 101
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 5
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 101 with NON-NULL msg
[XXXXX:1304207] posting recv on tag 101
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 101
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304207] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 5
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 101 with NON-NULL msg
[XXXXX:1304207] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 101
[XXXXX:1304207] ptl:base:send_handler MSG SENT
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 101 SIZE 109027
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 109027
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304207] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304207] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304207] [slurm.pmix.13957075.0,0] RECVD MSG FROM [pmix-server,1304198] FOR TAG 101 SIZE 109027
[XXXXX:1304207] ptl:base:recv:handler allocate data region of size 109027
[XXXXX:1304207] slurm.pmix.13957075.0:0 RECVD COMPLETE MESSAGE FROM SERVER OF 109027 BYTES FOR TAG 101 ON PEER SOCKET 13
[XXXXX:1304207] slurm.pmix.13957075.0:0 message received 109027 bytes for tag 101 on socket 13
[XXXXX:1304207] checking msg on tag 101 for tag 101
[XXXXX:1304207] slurm.pmix.13957075.0:0 EXECUTE CALLBACK for tag 101 with 109027 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 109027 BYTES FOR TAG 101 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 109027 bytes for tag 101 on socket 13
[XXXXX:1304208] checking msg on tag 101 for tag 101
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 101 with 109027 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304207] slurm.pmix.13957075.0:0 CALLBACK COMPLETE
[XXXXX:1304207] posting recv on tag 102
[XXXXX:1304207] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 110
[XXXXX:1304208] posting recv on tag 102
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 110
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 102 with NON-NULL msg
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 102
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 102 with NON-NULL msg
[XXXXX:1304207] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 102
[XXXXX:1304207] ptl:base:send_handler MSG SENT
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 102 SIZE 109089
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 109089
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304207] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304207] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304207] [slurm.pmix.13957075.0,0] RECVD MSG FROM [pmix-server,1304198] FOR TAG 102 SIZE 109089
[XXXXX:1304207] ptl:base:recv:handler allocate data region of size 109089
[XXXXX:1304207] slurm.pmix.13957075.0:0 RECVD COMPLETE MESSAGE FROM SERVER OF 109089 BYTES FOR TAG 102 ON PEER SOCKET 13
[XXXXX:1304207] slurm.pmix.13957075.0:0 message received 109089 bytes for tag 102 on socket 13
[XXXXX:1304207] checking msg on tag 102 for tag 102
[XXXXX:1304207] slurm.pmix.13957075.0:0 EXECUTE CALLBACK for tag 102 with 109089 bytes
[XXXXX:1304207] slurm.pmix.13957075.0:0 CALLBACK COMPLETE
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 109089 BYTES FOR TAG 102 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 109089 bytes for tag 102 on socket 13
[XXXXX:1304208] checking msg on tag 102 for tag 102
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 102 with 109089 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304207] posting recv on tag 103
[XXXXX:1304207] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 33
[XXXXX:1304208] posting recv on tag 103
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 33
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 103 with NON-NULL msg
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 103
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 103 with NON-NULL msg
[XXXXX:1304207] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 103
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304207] ptl:base:send_handler MSG SENT
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 103 SIZE 8
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 103 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 8 bytes for tag 103 on socket 13
[XXXXX:1304208] checking msg on tag 103 for tag 103
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 103 with 8 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304208] posting recv on tag 104
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 49
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 104 with NON-NULL msg
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 104
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304207] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304207] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304207] [slurm.pmix.13957075.0,0] RECVD MSG FROM [pmix-server,1304198] FOR TAG 103 SIZE 8
[XXXXX:1304207] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304207] slurm.pmix.13957075.0:0 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 103 ON PEER SOCKET 13
[XXXXX:1304207] slurm.pmix.13957075.0:0 message received 8 bytes for tag 103 on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 104 SIZE 8
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 104 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 8 bytes for tag 104 on socket 13
[XXXXX:1304208] checking msg on tag 104 for tag 104
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 104 with 8 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304207] checking msg on tag 103 for tag 103
[XXXXX:1304207] slurm.pmix.13957075.0:0 EXECUTE CALLBACK for tag 103 with 8 bytes
[XXXXX:1304207] slurm.pmix.13957075.0:0 CALLBACK COMPLETE
[XXXXX:1304208] posting recv on tag 105
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 41
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 105 with NON-NULL msg
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 105
[XXXXX:1304207] posting recv on tag 104
[XXXXX:1304207] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 49
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 104 with NON-NULL msg
[XXXXX:1304207] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 104
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304207] ptl:base:send_handler MSG SENT
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 105 SIZE 8
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 105 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 8 bytes for tag 105 on socket 13
[XXXXX:1304208] checking msg on tag 105 for tag 105
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 105 with 8 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304207] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304207] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304207] [slurm.pmix.13957075.0,0] RECVD MSG FROM [pmix-server,1304198] FOR TAG 104 SIZE 8
[XXXXX:1304207] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304207] slurm.pmix.13957075.0:0 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 104 ON PEER SOCKET 13
[XXXXX:1304207] slurm.pmix.13957075.0:0 message received 8 bytes for tag 104 on socket 13
[XXXXX:1304207] checking msg on tag 104 for tag 104
[XXXXX:1304207] slurm.pmix.13957075.0:0 EXECUTE CALLBACK for tag 104 with 8 bytes
[XXXXX:1304207] slurm.pmix.13957075.0:0 CALLBACK COMPLETE
[XXXXX:1304207] posting recv on tag 105
[XXXXX:1304207] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 41
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 105 with NON-NULL msg
[XXXXX:1304207] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 105
[XXXXX:1304207] ptl:base:send_handler MSG SENT
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304207] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304207] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304207] [slurm.pmix.13957075.0,0] RECVD MSG FROM [pmix-server,1304198] FOR TAG 105 SIZE 8
[XXXXX:1304207] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304207] slurm.pmix.13957075.0:0 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 105 ON PEER SOCKET 13
[XXXXX:1304207] slurm.pmix.13957075.0:0 message received 8 bytes for tag 105 on socket 13
[XXXXX:1304207] checking msg on tag 105 for tag 105
[XXXXX:1304207] slurm.pmix.13957075.0:0 EXECUTE CALLBACK for tag 105 with 8 bytes
[XXXXX:1304207] slurm.pmix.13957075.0:0 CALLBACK COMPLETE
[XXXXX:1304207] posting recv on tag 106
[XXXXX:1304207] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 486
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 106 with NON-NULL msg
[XXXXX:1304207] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 106
[XXXXX:1304207] ptl:base:send_handler MSG SENT
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304207] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304207] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304207] [slurm.pmix.13957075.0,0] RECVD MSG FROM [pmix-server,1304198] FOR TAG 106 SIZE 8
[XXXXX:1304207] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304207] slurm.pmix.13957075.0:0 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 106 ON PEER SOCKET 13
[XXXXX:1304207] slurm.pmix.13957075.0:0 message received 8 bytes for tag 106 on socket 13
[XXXXX:1304207] checking msg on tag 106 for tag 106
[XXXXX:1304207] slurm.pmix.13957075.0:0 EXECUTE CALLBACK for tag 106 with 8 bytes
[XXXXX:1304208] posting recv on tag 106
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 486
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 106 with NON-NULL msg
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 106
[XXXXX:1304207] slurm.pmix.13957075.0:0 CALLBACK COMPLETE
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 106 SIZE 8
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 106 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 8 bytes for tag 106 on socket 13
[XXXXX:1304208] checking msg on tag 106 for tag 106
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 106 with 8 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304208] posting recv on tag 107
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 146
[XXXXX:1304207] posting recv on tag 107
[XXXXX:1304207] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 146
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 107 with NON-NULL msg
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 107
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 107 with NON-NULL msg
[XXXXX:1304207] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 107
[XXXXX:1304207] ptl:base:send_handler MSG SENT
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 107 SIZE 8
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 107 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 8 bytes for tag 107 on socket 13
[XXXXX:1304208] checking msg on tag 107 for tag 107
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 107 with 8 bytes
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304207] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304207] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304207] [slurm.pmix.13957075.0,0] RECVD MSG FROM [pmix-server,1304198] FOR TAG 107 SIZE 8
[XXXXX:1304207] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304207] slurm.pmix.13957075.0:0 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 107 ON PEER SOCKET 13
[XXXXX:1304207] slurm.pmix.13957075.0:0 message received 8 bytes for tag 107 on socket 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304207] checking msg on tag 107 for tag 107
[XXXXX:1304207] slurm.pmix.13957075.0:0 EXECUTE CALLBACK for tag 107 with 8 bytes
[XXXXX:1304207] slurm.pmix.13957075.0:0 CALLBACK COMPLETE
[XXXXX:1304208] posting recv on tag 108
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 1003
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 108 with NON-NULL msg
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 108
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 108 SIZE 8
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 108 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 8 bytes for tag 108 on socket 13
[XXXXX:1304208] checking msg on tag 108 for tag 108
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 108 with 8 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304208] posting recv on tag 109
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 95
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 109 with NON-NULL msg
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 109
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304207] posting recv on tag 108
[XXXXX:1304207] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 1075
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 108 with NON-NULL msg
[XXXXX:1304207] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 108
[XXXXX:1304207] ptl:base:send_handler MSG SENT
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304207] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304207] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304207] [slurm.pmix.13957075.0,0] RECVD MSG FROM [pmix-server,1304198] FOR TAG 108 SIZE 8
[XXXXX:1304207] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304207] slurm.pmix.13957075.0:0 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 108 ON PEER SOCKET 13
[XXXXX:1304207] slurm.pmix.13957075.0:0 message received 8 bytes for tag 108 on socket 13
[XXXXX:1304207] checking msg on tag 108 for tag 108
[XXXXX:1304207] slurm.pmix.13957075.0:0 EXECUTE CALLBACK for tag 108 with 8 bytes
[XXXXX:1304207] slurm.pmix.13957075.0:0 CALLBACK COMPLETE
[XXXXX:1304207] posting recv on tag 109
[XXXXX:1304207] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 95
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 109 with NON-NULL msg
[XXXXX:1304207] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 109
[XXXXX:1304207] ptl:base:send_handler MSG SENT
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304207] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304207] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 109 SIZE 8
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304207] [slurm.pmix.13957075.0,0] RECVD MSG FROM [pmix-server,1304198] FOR TAG 109 SIZE 8
[XXXXX:1304207] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304207] slurm.pmix.13957075.0:0 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 109 ON PEER SOCKET 13
[XXXXX:1304207] slurm.pmix.13957075.0:0 message received 8 bytes for tag 109 on socket 13
[XXXXX:1304207] checking msg on tag 109 for tag 109
[XXXXX:1304207] slurm.pmix.13957075.0:0 EXECUTE CALLBACK for tag 109 with 8 bytes
[XXXXX:1304207] slurm.pmix.13957075.0:0 CALLBACK COMPLETE
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 109 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 8 bytes for tag 109 on socket 13
[XXXXX:1304208] checking msg on tag 109 for tag 109
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 109 with 8 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304207] posting recv on tag 110
[XXXXX:1304207] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 78
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 110 with NON-NULL msg
[XXXXX:1304207] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 110
[XXXXX:1304207] ptl:base:send_handler MSG SENT
[XXXXX:1304208] posting recv on tag 110
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 78
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 110 with NON-NULL msg
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 110
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 110 SIZE 725
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 725
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 725 BYTES FOR TAG 110 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 725 bytes for tag 110 on socket 13
[XXXXX:1304208] checking msg on tag 110 for tag 110
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 110 with 725 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304208] posting recv on tag 111
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 78
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 111 with NON-NULL msg
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 111
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304207] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304207] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304207] [slurm.pmix.13957075.0,0] RECVD MSG FROM [pmix-server,1304198] FOR TAG 110 SIZE 8
[XXXXX:1304207] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304207] slurm.pmix.13957075.0:0 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 110 ON PEER SOCKET 13
[XXXXX:1304207] slurm.pmix.13957075.0:0 message received 8 bytes for tag 110 on socket 13
[XXXXX:1304207] checking msg on tag 110 for tag 110
[XXXXX:1304207] slurm.pmix.13957075.0:0 EXECUTE CALLBACK for tag 110 with 8 bytes
[XXXXX:1304207] slurm.pmix.13957075.0:0 CALLBACK COMPLETE
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 111 SIZE 8
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 111 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 8 bytes for tag 111 on socket 13
[XXXXX:1304208] checking msg on tag 111 for tag 111
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 111 with 8 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304207] posting recv on tag 111
[XXXXX:1304207] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 810
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 111 with NON-NULL msg
[XXXXX:1304207] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 111
[XXXXX:1304207] ptl:base:send_handler MSG SENT
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304207] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304207] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304207] [slurm.pmix.13957075.0,0] RECVD MSG FROM [pmix-server,1304198] FOR TAG 111 SIZE 8
[XXXXX:1304207] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304207] slurm.pmix.13957075.0:0 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 111 ON PEER SOCKET 13
[XXXXX:1304207] slurm.pmix.13957075.0:0 message received 8 bytes for tag 111 on socket 13
[XXXXX:1304207] checking msg on tag 111 for tag 111
[XXXXX:1304207] slurm.pmix.13957075.0:0 EXECUTE CALLBACK for tag 111 with 8 bytes
[XXXXX:1304207] slurm.pmix.13957075.0:0 CALLBACK COMPLETE
[XXXXX:1304207] posting recv on tag 112
[XXXXX:1304207] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 834
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 112 with NON-NULL msg
[XXXXX:1304207] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 112
[XXXXX:1304207] ptl:base:send_handler MSG SENT
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304207] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304207] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304207] [slurm.pmix.13957075.0,0] RECVD MSG FROM [pmix-server,1304198] FOR TAG 112 SIZE 8
[XXXXX:1304207] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304207] slurm.pmix.13957075.0:0 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 112 ON PEER SOCKET 13
[XXXXX:1304207] slurm.pmix.13957075.0:0 message received 8 bytes for tag 112 on socket 13
[XXXXX:1304207] checking msg on tag 112 for tag 112
[XXXXX:1304207] slurm.pmix.13957075.0:0 EXECUTE CALLBACK for tag 112 with 8 bytes
[XXXXX:1304207] slurm.pmix.13957075.0:0 CALLBACK COMPLETE
[XXXXX:1304208] posting recv on tag 112
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 810
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 112 with NON-NULL msg
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 112
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304208] posting recv on tag 113
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 834
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 112 SIZE 8
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 112 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 8 bytes for tag 112 on socket 13
[XXXXX:1304208] checking msg on tag 112 for tag 113
[XXXXX:1304208] checking msg on tag 112 for tag 112
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 112 with 8 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 113 with NON-NULL msg
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 113
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 113 SIZE 8
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 113 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 8 bytes for tag 113 on socket 13
[XXXXX:1304208] checking msg on tag 113 for tag 113
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 113 with 8 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304208] posting recv on tag 114
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 551
[XXXXX:1304208] posting recv on tag 115
[XXXXX:1304208] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 35
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 114 with NON-NULL msg
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 114
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 115 with NON-NULL msg
[XXXXX:1304208] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 115
[XXXXX:1304208] ptl:base:send_handler MSG SENT
[XXXXX:1304207] posting recv on tag 113
[XXXXX:1304207] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 551
[XXXXX:1304207] posting recv on tag 114
[XXXXX:1304207] QUEUEING MSG TO SERVER [pmix-server,1304198] ON SOCKET 13 OF SIZE 35
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 113 with NON-NULL msg
[XXXXX:1304207] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 113
[XXXXX:1304207] ptl:base:send_handler MSG SENT
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:send_handler SENDING TO PEER [pmix-server,1304198] tag 114 with NON-NULL msg
[XXXXX:1304207] ptl:base:send_handler SENDING MSG TO [pmix-server,1304198] TAG 114
[XXXXX:1304207] ptl:base:send_handler MSG SENT
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 114 SIZE 8
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 114 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 8 bytes for tag 114 on socket 13
[XXXXX:1304208] checking msg on tag 114 for tag 115
[XXXXX:1304208] checking msg on tag 114 for tag 114
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 114 with 8 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304208] [slurm.pmix.13957075.0,1] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304208] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304208] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304208] [slurm.pmix.13957075.0,1] RECVD MSG FROM [pmix-server,1304198] FOR TAG 115 SIZE 8
[XXXXX:1304208] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304208] slurm.pmix.13957075.0:1 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 115 ON PEER SOCKET 13
[XXXXX:1304208] slurm.pmix.13957075.0:1 message received 8 bytes for tag 115 on socket 13
[XXXXX:1304207] [slurm.pmix.13957075.0,0] ptl:base:recv:handler called with peer pmix-server:1304198
[XXXXX:1304207] ptl:base:recv:handler allocate new recv msg
[XXXXX:1304207] ptl:base:recv:handler read hdr on socket 13
[XXXXX:1304207] [slurm.pmix.13957075.0,0] RECVD MSG FROM [pmix-server,1304198] FOR TAG 113 SIZE 8
[XXXXX:1304207] ptl:base:recv:handler allocate data region of size 8
[XXXXX:1304208] checking msg on tag 115 for tag 115
[XXXXX:1304208] slurm.pmix.13957075.0:1 EXECUTE CALLBACK for tag 115 with 8 bytes
[XXXXX:1304208] slurm.pmix.13957075.0:1 CALLBACK COMPLETE
[XXXXX:1304207] slurm.pmix.13957075.0:0 RECVD COMPLETE MESSAGE FROM SERVER OF 8 BYTES FOR TAG 113 ON PEER SOCKET 13
[XXXXX:1304207] slurm.pmix.13957075.0:0 message received 8 bytes for tag 113 on socket 13
[XXXXX:1304207] checking msg on tag 113 for tag 114
[XXXXX:1304207] checking msg on tag 113 for tag 113
[XXXXX:1304207] slurm.pmix.13957075.0:0 EXECUTE CALLBACK for tag 113 with 8 bytes
[XXXXX:1304207] slurm.pmix.13957075.0:0 CALLBACK COMPLETE
slurmstepd: error: *** STEP 13957075.0 ON XXXXX CANCELLED AT 2024-10-29T18:43:15 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: XXXXX: task 0: Killed
srun: Terminating StepId=13957075.0
srun: error: XXXXX: task 1: Exited with exit code 14
rhc54 commented 3 weeks ago

PMIx has nothing to do with the fabric, so the test doesn't mean anything per the reported issue. Likewise, the Slurm PMI2 module is independent of the PMIx one, and so the question remains regarding provision of the OmniPath token. You could just srun -n 1 env (ensuring the PMIx module is active) and see if OMPI_MCA_orte_precondition_transports is in the environment - if not, then OmniPath will have a problem.

mcarmesin commented 3 weeks ago

OMPI_MCA_orte_precondition_transports is indeed not present. The strange thing is, that it is also not present with Open MPI 4.0, where the PSM2 MTL seems to be working correctly. Is setting this variable a new requirement for Open MPI 5.0 or PMIX? I couldn't find anything in the OpenMPI documentation.

mcarmesin commented 6 days ago

I managed to get Open MPI 4.1 running with PMIX and PML cm / MTL PSM2, even though OMPI_MCA_orte_precondition_transport is not set. So the problem is most likely on the OpenMPI side and not related to the environment variable.