Open psteinbrecher opened 7 months ago
I see. So it is the IPC path. We need to make IPC GPU copy nonblocking.
Let us know if you need any kind of help.
@jxy Thanks. Could you confirm that the blocking nature of MPI_Test
is observed for both intra-node and inter-node? I'll focus on the intra-node first and ping you for testing when we have a patch.
Yes, both have the issue. Getting intra-node first sounds like a good plan.
@psteinbrecher @jxy This PR https://github.com/pmodels/mpich/pull/6841 should fix the blocking issues of MPI_Test at least for contiguous datatype intra-node. Could you test?
Yes, let me try to build and run it!
Overlap not happening with this change. Let's discuss directly. Will reach out to you.
Tested the reproducer using the main branch on sunspot, running two process on single node, with some printf debugging:
[1] num_elements = 128000000 (max 256000000)
[0] num_elements = 128000000 (max 256000000)
[0] . MPIDI_IPCI_handle_lmt_recv: src_data_sz=512000000, data_sz=512000000, dev_id=0, map_dev=0
[0] . engine=1 MPIR_Ilocalcopy_gpu...
[0] . MPIR_Async_things_add: is MPIR_GPU_REQUEST: 1, async_count = 0
[0] . MPIDI_IPCI_handle_lmt_recv: src_data_sz=512000000, data_sz=512000000, dev_id=0, map_dev=0
[0] . engine=1 MPIR_Ilocalcopy_gpu...
[0] . MPIR_Async_things_add: is MPIR_GPU_REQUEST: 1, async_count = 1
[1] . MPIDI_IPCI_handle_lmt_recv: src_data_sz=512000000, data_sz=512000000, dev_id=0, map_dev=0
[1] . engine=1 MPIR_Ilocalcopy_gpu...
[1] . MPIR_Async_things_add: is MPIR_GPU_REQUEST: 1, async_count = 0
[1] . MPIDI_IPCI_handle_lmt_recv: src_data_sz=512000000, data_sz=512000000, dev_id=0, map_dev=0
[1] . engine=1 MPIR_Ilocalcopy_gpu...
[1] . MPIR_Async_things_add: is MPIR_GPU_REQUEST: 1, async_count = 1
[1] . gpu_ipc_async_poll succeeded, async_count = 6258
[1] recv_request completed after MPI_Test 3131 times
[1] . gpu_ipc_async_poll succeeded, async_count = 11184
[0] . gpu_ipc_async_poll succeeded, async_count = 38532
[0] recv_request completed after MPI_Test 19268 times
[1] recv_request2 completed after MPI_Test 1 times
[0] . gpu_ipc_async_poll succeeded, async_count = 64342
[0] recv_request2 completed after MPI_Test 25809 times
[0] local_time = 38.414 ms
[0] ================ rank 0 =====================
[0] STREAM array size = 1024 MB
[0] MPI message size = 512 MB
[0] total time = 38414 usec
[0] MPI bandwidth = 26.6569 GB/s
[1] local_time = 38.404 ms
[1] ================ rank [1] 1 =====================
[1] STREAM array size = [1] 1024 MB
[1] MPI message size = 512 MB
[1] total time = 38404 usec
[1] MPI bandwidth = 26.6639 GB/s
I think this shows that MPI_Test
is not blocking, otherwise, it will complete after 1 time.
Yes, sorry. Was mistake on my side with an older test I provided here. It works fine now. Will test it with QUDA next. Overlap of comm and compute is seen as well as multiple comms running in parallel.
QUDA applications relies on non-blocking behavior of MPI_Test. However, current MPICH implementation of MPI_Test is blocking. When tracing I see that MPI_Test only returns when communication is complete. MPI_Test should just test if communication is complete and not wait for completion. QUDA app sends multiple message for halo exchange. This issue serializes all communication within a halo exchange.
QUDA also benefits from MPI_Start actually starting the communication. But current MPICH implementation does not do this. Whatever change we put in to fix MPI_Test could be used in MPI_Start, e.g. by calling MPI_Test inside MPI_Start at end of function.
Here are the other MPI implementations that have the needed behavior: OpenMPI HPCX Cray MPI MVAPICH
You can test the behavior on Aurora with attached reproducer t.cpp. t.zip
Run on 1 Aurora node via:
Then use your favorite tracing tool, e.g. unitrace or iprof, to see if all comms run in parallel and also overlap with compute kernel.
Here is screenshot that shows that we do not have overlap or parallel comms due to blocking nature of MPI_Test in MPICH.