open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 858 forks source link

Performance issue of MPI_Allgatherv compared to Cray MPICH #11765

Open changliu777 opened 1 year ago

changliu777 commented 1 year ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

openmpi-v5.0.x-202306140342-9260266

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from a source/distribution tarball


Details of the problem

I am trying to understand the performance issue of OpenMPI MPI_Allgatherv performance, and compare with the the MPICH shipped by Cray. Here is the test code I used,

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

#define DATA_SIZE 40000000

int main(int argc, char** argv) {
    int rank, size;
    int *send_data, *recv_data, *recv_counts, *recv_displs;
    double start_time, end_time, elapsed_time;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    // Allocate memory for send and receive data
    send_data = (int*)malloc(sizeof(int) * DATA_SIZE);
    recv_data = (int*)malloc(sizeof(int) * DATA_SIZE * size);
    recv_counts = (int*)malloc(sizeof(int) * size);
    recv_displs = (int*)malloc(sizeof(int) * size);

    // Initialize send data with some values
    for (int i = 0; i < DATA_SIZE; i++) {
        send_data[i] = rank + i;
    }

    // Calculate receive counts and displacements
    for (int i = 0; i < size; i++) {
        recv_counts[i] = DATA_SIZE;
        recv_displs[i] = i * DATA_SIZE;
    }

    // Perform MPI_Allgatherv
    start_time = MPI_Wtime();
    fprintf(stderr, "start\n");
    MPI_Allgatherv(send_data, DATA_SIZE, MPI_INT, recv_data, recv_counts, recv_displs, MPI_INT, MPI_COMM_WORLD);
    MPI_Barrier(MPI_COMM_WORLD);
    fprintf(stderr, "finish\n");
    end_time = MPI_Wtime();
    elapsed_time = end_time - start_time;

    // Print the result on rank 0
    if (rank == 0) {
        for (int i = size*DATA_SIZE-10; i < size * DATA_SIZE; i++) {
            printf("%d ", recv_data[i]);
        }
        printf("\n");
        printf("Elapsed Time: %lf seconds\n", elapsed_time);
    }

    // Clean up
    free(send_data);
    free(recv_data);
    free(recv_counts);
    free(recv_displs);
    MPI_Finalize();

    return 0;
}

and here is the results of running OpenMPI compiled version on 32 nodes,

mpirun -np 32 -hostfile hostfile --report-bindings --bind-to none --oversubscribe --mca pml ucx ./a.out
...
Elapsed Time: 5.153137 seconds

For comparison, the results using Cray MPICH

srun -n 32 ./a.out
...
Elapsed Time: 0.724109 seconds

So there is a big slowdown on OpenMPI side. This results is obtained using the daily tarball openmpi-v5.0.x-202306140342-9260266. I have tested the master branch and got similar result. For openmpi-v4.0.x branch the performance is 3x slower.

I also found another interesting issue. The above test was done by allocating one process per node. If I put all 32 processes on one node, I got the following results for OpenMPI

mpirun -np 32 -hostfile hostfile --report-bindings --bind-to none --oversubscribe --mca pml ucx ./a.out
...
Elapsed Time: 4.941182 seconds

and for Cray MPICH

srun -n 32 ./a.out
...
Elapsed Time: 7.161596 seconds

so the performance is close if data transfer happens on one single node.

ggouaillardet commented 1 year ago

You should rely on proven benchmarks such IMB from Intel or the OSU test suite from Ohio University in order to evaluate MPI performances.

The very first collective from any communicator might require some connexions to be established, and hence be much slower from the following ones, so you should at least run a few warmup iterations in order to hide these one time costs.

Since the simpler MPI_Allgather() can do the trick here, did you compare MPI_Allgather() vs MPI_Allgatherv() performances?

The default algorithm used here might also not be the fastest, so you can consider evaluating coll/tuned vs coll/han and try to change the algorithms used by these modules.

Message ID: @.***>