Performance issue of MPI_Allgatherv compared to Cray MPICH

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

openmpi-v5.0.x-202306140342-9260266

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from a source/distribution tarball

Details of the problem

I am trying to understand the performance issue of OpenMPI MPI_Allgatherv performance, and compare with the the MPICH shipped by Cray. Here is the test code I used,

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

#define DATA_SIZE 40000000

int main(int argc, char** argv) {
    int rank, size;
    int *send_data, *recv_data, *recv_counts, *recv_displs;
    double start_time, end_time, elapsed_time;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    // Allocate memory for send and receive data
    send_data = (int*)malloc(sizeof(int) * DATA_SIZE);
    recv_data = (int*)malloc(sizeof(int) * DATA_SIZE * size);
    recv_counts = (int*)malloc(sizeof(int) * size);
    recv_displs = (int*)malloc(sizeof(int) * size);

    // Initialize send data with some values
    for (int i = 0; i < DATA_SIZE; i++) {
        send_data[i] = rank + i;
    }

    // Calculate receive counts and displacements
    for (int i = 0; i < size; i++) {
        recv_counts[i] = DATA_SIZE;
        recv_displs[i] = i * DATA_SIZE;
    }

    // Perform MPI_Allgatherv
    start_time = MPI_Wtime();
    fprintf(stderr, "start\n");
    MPI_Allgatherv(send_data, DATA_SIZE, MPI_INT, recv_data, recv_counts, recv_displs, MPI_INT, MPI_COMM_WORLD);
    MPI_Barrier(MPI_COMM_WORLD);
    fprintf(stderr, "finish\n");
    end_time = MPI_Wtime();
    elapsed_time = end_time - start_time;

    // Print the result on rank 0
    if (rank == 0) {
        for (int i = size*DATA_SIZE-10; i < size * DATA_SIZE; i++) {
            printf("%d ", recv_data[i]);
        }
        printf("\n");
        printf("Elapsed Time: %lf seconds\n", elapsed_time);
    }

    // Clean up
    free(send_data);
    free(recv_data);
    free(recv_counts);
    free(recv_displs);
    MPI_Finalize();

    return 0;
}

and here is the results of running OpenMPI compiled version on 32 nodes,

mpirun -np 32 -hostfile hostfile --report-bindings --bind-to none --oversubscribe --mca pml ucx ./a.out
...
Elapsed Time: 5.153137 seconds

For comparison, the results using Cray MPICH

srun -n 32 ./a.out
...
Elapsed Time: 0.724109 seconds

So there is a big slowdown on OpenMPI side. This results is obtained using the daily tarball openmpi-v5.0.x-202306140342-9260266. I have tested the master branch and got similar result. For openmpi-v4.0.x branch the performance is 3x slower.

I also found another interesting issue. The above test was done by allocating one process per node. If I put all 32 processes on one node, I got the following results for OpenMPI

mpirun -np 32 -hostfile hostfile --report-bindings --bind-to none --oversubscribe --mca pml ucx ./a.out
...
Elapsed Time: 4.941182 seconds

and for Cray MPICH

srun -n 32 ./a.out
...
Elapsed Time: 7.161596 seconds

so the performance is close if data transfer happens on one single node.

open-mpi / ompi