Algorithm comparision between MVAPICH and SST-Macro

Hi~ Question: Is there a reasonable path in MVAPICH that is the same as the algorithm used in SST-Macro, in other words, the two benchmarks use the same parameter file (same hardware information). Can SST-Macro get the same similar results as MVAPICH. Take MPI_Allreduce and MPI_Barrier as an example:

Algorithm in SST-Macro:

MPI_Barrier : bruck algorithm
MPI_Allreduce : Wilke-Halving (The wilke algorithm is a variation binary blocks algorithm) 2.1 First reduce rounds(similar to recuriseve-halving algorithm) 2.2 Second recv rounds (similar to bruck algorithm)

Algorithm in MVAPICH :

MPI_Barrier : 1.1 : if mv2_use_osu_collectives：(default) use pairwise exchange with recursive doubling algorithm 1.2 : else : dissemination algorithm (the bruck algorithm)
MPI_Barrier : 2.1 : if mv2_use_osu_collectives：(default) What algorithm is not analyzed 2.2 : else :
short messages: size <= MPIR_CVAR_ALLREDUCE_LONG_MSG_SIZE long messages: size > MPIR_CVAR_ALLREDUCE_LONG_MSG_SIZE 2.2.1 For long messages , we use Rabenseifner's algorithm. First recuriseve-halving algorithm is used. Second recursive doubling algorithm is used. 2.2.2 For short messages, we use a recursive doubling algorithm.

Based on the algorithm implemented by MPI_Allreduce and MPI_Barrier, it is found that the same algorithm is not used by default in SST-Macro and MVAPICH. The current test osu_allreduce and osu_barrer benchmarks are in SST-Macro and MVAPICH, and the results are quite different. As shown in the figure below: The configuration information is shown in parameter.ini (same as the hardware information) parameters.ini (all benchmark use the same one) node { name = simple app1 { launch_cmd = aprun -n 4 -N 1 exe=./osu_allreduce_sst allocation = node_id node_id_allocation_file = andy-node_id_allocation_topo1_4.txt mpi { max_vshort_msg_size = 16384 max_eager_msg_size = 16384 post_header_delay = 0.81us post_rdma_delay = 0.13us rdma_pin_latency = 0.9us rdma_page_delay = 1ns eager_cutoff = 524288 allgather = ring } } proc { frequency = 2.6 GHz ncores = 8 parallelism = 16 } memory { name = pisces total_bandwidth = 12.8GB/s latency = 12.5ns

arbitrator = cut_through

} nic { name = pisces negligible_size = 0 injection { mtu = 4096 arbitrator = cut_through bandwidth = 100Gb/s latency = 300ns credits = 64KB } ejection{ mtu = 4096 arbitrator = cut_through bandwidth = 100Gb/s latency = 300ns credits = 64KB } } os{ compute_scheduler = simple stack_size = 128KB stack_chunk_size = 2MB } } switch { router { name = table } name = pisces arbitrator = cut_through mtu = 512 link { bandwidth = 200Gb/s latency = 130ns credits = 64KB } xbar { bandwidth = 16Tb/s } logp { bandwidth = 200Gb/s hop_latency = 116ns out_in_latency = 60ns } }

topology { name = file filename = topology.json routing_tables = routing-table.json }

Using a performance KPI to measure the results of osu_allreduce and osu_barrier (MVAPICH and SST-Macro comparison), the performance can only reach 60% and 70% similar

Hence the question:: Is there a reasonable path in MVAPICH that is the same as the algorithm used in SST-Macro, in other words, the two benchmarks use the same parameter file (same hardware information). Can SST-Macro get the same similar results as MVAPICH. Thanks a lot,

sstsimulator / sst-macro

Algorithm comparision between MVAPICH and SST-Macro #684