Open LukasBreitwieser opened 7 months ago
I don't have a good explanation for what is going on here but a few ideas on how to narrow down the problem:
1) Can you observe the problem if all ranks run on a single node? 2) Can you observe the problem if all processes run on distinct nodes? 3) Do you see similar behavior with one of the standard benchmarks (like OSU p2p)?
Many thanks for your reply Joseph!
- Can you observe the problem if all ranks run on a single node?
On one node, the results are fine: systemA-1node-72ranks.log
- Can you observe the problem if all processes run on distinct nodes?
I modified the reproducer such that there are only 2 ranks that transfer multiple messages with each other and could not reproduce the issue (see reproducer-2-ranks-multiple-messages.cc.log). The transferred data volume and the number of messages sent between the two nodes are the same. Only the messages that were sent to ranks on the same node are absent in this experiment. Here are the results: systemA-2ranks-multiple-messages.log
- Do you see similar behavior with one of the standard benchmarks (like OSU p2p)?
As far as I can tell, the pt2pt OSU benchmark results look ok. I attach the results for runs with different numbers of ranks and nodes (osu-pt2pt-results.zip). Note: Only the osu_multi_lat and osu_mbw_mr benchmarks support more than one rank. MPI_Isend and MPI_Irecv are used only in osu_mbw_mr, osu_bibw, and osu_bw.
It seems like the issue is related to the specific communication pattern.
I also ran the original reproducer through hpctoolkit (System A, 2 nodes, 144 ranks). The blue bars show the MPI_Waits. It shows that waits are mostly on node 2. I still need an explanation for this assymetry.
I would also like to see profiling information for MPI itself.
However, hpctoolkit only shows one more thread besides the application, and that seems to be in epoll_wait
the entire time.
I assume that there must be an MPI thread that handles the non-blocking transfers, which is missing in the HPCtoolkit analysis. Do you have a recommendation on how to profile MPI internals best and view them on a timeline?
I was benchmarking some related isend data recently. I wonder if you run these what do you see?
https://github.com/lrbison/lrbison_benchmarks/tree/main
In particular documentation is pretty light, but take a look at the output and adjust make_data.sbatch as needed.
To see what you describe where sometimes results are fast, othertimes they are slow, you probably want to get results from at least these two runs:
# isend, irecv for gather pattern (incast), showing time for each message, and executing 100 trials
mpirun -- ./bench_isend -i -n 1 -t 100 -w 100 -f filename_ii.dat -m $MSG_SIZE -s -r
# isend, irecv for scatter pattern (outcast), showing time for each message, and executing 100 trials
mpirun -- ./bench_isend -o -n 1 -t 100 -w 100 -f filename_ii.dat -m $MSG_SIZE -s -r
Hi,
I am working on a distributed 3D agent-based simulation tool and noticed that sometimes non-blocking point-to-point operations take a very long time to complete.
I created a simplified reproducer that demonstrates the behavior (
reproducer.cc
). Each rank exchanges messages with its neighbors usingMPI_Isend
andMPI_Irecv
calls. In the reproducer, the number of neighbors and the message size is the same for each rank. I measure the maximum time needed to send/receive all messages across all ranks. I performed the benchmarks on 2-node systems with Ethernet and Infiniband links. Half the neighbors (13) are on the same node and half on the opposite (13).Benchmark systems
System A: 2-nodes each with 4 Intel E7-8890 v3 processors (72 physical cores per node), 1 Gbps Ethernet link, Alma Linux 9, OpenMPI 5.0.2 installed with (
spack install openmpi@5.0.2 schedulers=slurm +legacylaunchers
)System B (Dutch National Supercomputer Snellius): 2-nodes each with 2 AMD Genoa 9654 processors (192 physical cores per node), Infiniband ConnectX-7 (200Gbps within a rack, 100Gbps outside the rack), RHEL 8.6, OpenMPI 4.1.5 (provided by the HPC center) https://servicedesk.surf.nl/wiki/display/WIKI/Snellius+hardware
Nodes were used exclusively.
Compilation
Results
Sometimes, I observe very different results for the exact same message size. The total amount of data transferred between the two nodes can be calculated with
num_ranks * num_neighbors / 2 * message_size
. e.g. system A with 711 byte messages: 144 13 711 bytes = 1330992 bytes = 1.27 MiB.Given the small amount of data, I did not expect to see transfer times of 216ms.
see
systemA_from-512_to-1024_step-+1.log
systemB_from-1_to-2097152_step-x2.log
Total amount of data transferred: 384 13 65536 = 327155712 bytes = 312 MiB I did not expect such a high variance nor transfer times >50ms given a line rate of 200 Gbps.Sometimes, a specific message size is fast 10x in a row. see
systemA_from-512_to-1024_step-+1.log
I used Intel traceanalyzer to get more insights. Below the histogram and CDF of the message arrival times on System A.
Do you have an idea of what could be causing this behavior and subsequently how to improve the performance? I tried to keep the description short, but I am happy to provide any further information that you may need.
Many thanks in advance for your help!
Best, Lukas
reproducer.log (GH disallows attachments with cc ending -> rename
reproducer.log
toreproducer.cc
) systemB_from-1_to-2097152_step-x2.log systemA_from-512_to-1024_step-+1.log systemA_from-1_to-8192_step-x2.log