MPI_Test is blocking and MPI_Start doe not start communication

psteinbrecher commented 7 months ago

QUDA applications relies on non-blocking behavior of MPI_Test. However, current MPICH implementation of MPI_Test is blocking. When tracing I see that MPI_Test only returns when communication is complete. MPI_Test should just test if communication is complete and not wait for completion. QUDA app sends multiple message for halo exchange. This issue serializes all communication within a halo exchange.

QUDA also benefits from MPI_Start actually starting the communication. But current MPICH implementation does not do this. Whatever change we put in to fix MPI_Test could be used in MPI_Start, e.g. by calling MPI_Test inside MPI_Start at end of function.

Here are the other MPI implementations that have the needed behavior: OpenMPI HPCX Cray MPI MVAPICH

You can test the behavior on Aurora with attached reproducer t.cpp. t.zip

Run on 1 Aurora node via:

mpicxx -O3 -fiopenmp -fopenmp-targets=spir64 ./t.cpp && mpiexec -np 2 -ppn 2 -envall --cpu-bind=verbose,list:2-8:12-18 ./run_aout.sh

#!/bin/bash

export MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=1
export MPIR_CVAR_GPU_ROUND_ROBIN_COMMAND_QUEUES=1
export MPIR_CVAR_CH4_GPU_RMA_ENGINE_TYPE=auto

if [ $PALS_LOCAL_RANKID -eq 0 ]
then

    ZE_AFFINITY_MASK=0.0 ./a.out

elif [ $PALS_LOCAL_RANKID -eq 1 ]
then

    ZE_AFFINITY_MASK=0.1 ./a.out
fi

Then use your favorite tracing tool, e.g. unitrace or iprof, to see if all comms run in parallel and also overlap with compute kernel.

Here is screenshot that shows that we do not have overlap or parallel comms due to blocking nature of MPI_Test in MPICH.

hzhou commented 7 months ago

I see. So it is the IPC path. We need to make IPC GPU copy nonblocking.

jxy commented 7 months ago

Let us know if you need any kind of help.

hzhou commented 7 months ago

@jxy Thanks. Could you confirm that the blocking nature of MPI_Test is observed for both intra-node and inter-node? I'll focus on the intra-node first and ping you for testing when we have a patch.

psteinbrecher commented 7 months ago

Yes, both have the issue. Getting intra-node first sounds like a good plan.

hzhou commented 5 months ago

@psteinbrecher @jxy This PR https://github.com/pmodels/mpich/pull/6841 should fix the blocking issues of MPI_Test at least for contiguous datatype intra-node. Could you test?

psteinbrecher commented 5 months ago

Yes, let me try to build and run it!

psteinbrecher commented 6 days ago

Overlap not happening with this change. Let's discuss directly. Will reach out to you.

hzhou commented 5 days ago

Tested the reproducer using the main branch on sunspot, running two process on single node, with some printf debugging:

[1] num_elements = 128000000 (max 256000000)
[0] num_elements = 128000000 (max 256000000)
[0] . MPIDI_IPCI_handle_lmt_recv: src_data_sz=512000000, data_sz=512000000, dev_id=0, map_dev=0
[0]   . engine=1 MPIR_Ilocalcopy_gpu...
[0]   . MPIR_Async_things_add: is MPIR_GPU_REQUEST: 1, async_count = 0
[0] . MPIDI_IPCI_handle_lmt_recv: src_data_sz=512000000, data_sz=512000000, dev_id=0, map_dev=0
[0]   . engine=1 MPIR_Ilocalcopy_gpu...
[0]   . MPIR_Async_things_add: is MPIR_GPU_REQUEST: 1, async_count = 1
[1] . MPIDI_IPCI_handle_lmt_recv: src_data_sz=512000000, data_sz=512000000, dev_id=0, map_dev=0
[1]   . engine=1 MPIR_Ilocalcopy_gpu...
[1]   . MPIR_Async_things_add: is MPIR_GPU_REQUEST: 1, async_count = 0
[1] . MPIDI_IPCI_handle_lmt_recv: src_data_sz=512000000, data_sz=512000000, dev_id=0, map_dev=0
[1]   . engine=1 MPIR_Ilocalcopy_gpu...
[1]   . MPIR_Async_things_add: is MPIR_GPU_REQUEST: 1, async_count = 1
[1]   . gpu_ipc_async_poll succeeded, async_count = 6258
[1] recv_request completed after MPI_Test 3131 times
[1]   . gpu_ipc_async_poll succeeded, async_count = 11184
[0]   . gpu_ipc_async_poll succeeded, async_count = 38532
[0] recv_request completed after MPI_Test 19268 times
[1] recv_request2 completed after MPI_Test 1 times
[0]   . gpu_ipc_async_poll succeeded, async_count = 64342
[0] recv_request2 completed after MPI_Test 25809 times
[0] local_time = 38.414 ms
[0] ================ rank 0 =====================
[0] STREAM array size = 1024 MB
[0] MPI message size = 512 MB
[0] total time = 38414 usec
[0] MPI bandwidth = 26.6569 GB/s
[1] local_time = 38.404 ms
[1] ================ rank [1] 1 =====================
[1] STREAM array size = [1] 1024 MB
[1] MPI message size = 512 MB
[1] total time = 38404 usec
[1] MPI bandwidth = 26.6639 GB/s

I think this shows that MPI_Test is not blocking, otherwise, it will complete after 1 time.

psteinbrecher commented 4 days ago

Yes, sorry. Was mistake on my side with an older test I provided here. It works fine now. Will test it with QUDA next. Overlap of comm and compute is seen as well as multiple comms running in parallel.

pmodels / mpich

MPI_Test is blocking and MPI_Start doe not start communication #6836