open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 857 forks source link

MPI_Intercomm_create leaks memory even with symmetric MPI_Comm_free calls #12019

Open oj-lappi opened 10 months ago

oj-lappi commented 10 months ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.6 (also verified on 4.0.3)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Debians libopenmpi-dev package (also verified on Ubuntu)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running


Details of the problem

MPI_Intercomm_create seems to leak memory, calling MPI_Comm_free on an intercomm does not free all memory allocated for the intercomm.

Here is a minimal example which you can build to see the issue:

#include <cstdlib>
#include <mpi.h>

int
main(int  /*argc*/, char ** /*argv*/)
{
    MPI_Init(nullptr, nullptr);
    int rank =-1;
    int world_size = -1;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    if (world_size % 2 != 0){
        exit(1);
    }

    int partner = (world_size + ((rank % 2 == 0)? rank + 1 : rank - 1)) % world_size;

    MPI_Comm comm = MPI_COMM_NULL;

    constexpr int iterations = 100000;
    for (int i = 0; i < iterations; i++){
        MPI_Intercomm_create(MPI_COMM_SELF, 0, MPI_COMM_WORLD, partner, i, &comm);
        MPI_Comm_free(&comm);
    }
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Finalize();
}

If you just watch the resident set size of two processes executing this, you will see them increasing linearly in time, with one process allocating roughly double what the other one does.

Compiling this with clang++-16 and -fsanitize=address, and running mpirun -n 2 a.out gives (among other minor memory leak output), the following output for iterations=100000:

... one rank
Direct leak of 6400416 byte(s) in 200013 object(s) allocated from:
    #0 0x55dde7e7afb2 in malloc (/project/build/test/mpi/mpi_intercomm_memleak+0xb9fb2) (BuildId: 46971416e65faae54e6870e4db559c394b3d131d)
    #1 0x7f6af13cbc12  (<unknown module>)

Direct leak of 1300013 byte(s) in 100001 object(s) allocated from:
    #0 0x55dde7e7afb2 in malloc (/project/build/test/mpi/mpi_intercomm_memleak+0xb9fb2) (BuildId: 46971416e65faae54e6870e4db559c394b3d131d)
    #1 0x7f6af4d24017 in __vasprintf_internal libio/vasprintf.c:116:16
    #2 0x9abaa5f78902caff  (<unknown module>)
...

SUMMARY: AddressSanitizer: 8142121 byte(s) leaked in 400523 allocation(s).

... other rank
Direct leak of 3200384 byte(s) in 100012 object(s) allocated from:
    #0 0x5632a7758fb2 in malloc (/project/build/test/mpi/mpi_intercomm_memleak+0xb9fb2) (BuildId: 46971416e65faae54e6870e4db559c394b3d131d)
    #1 0x7f01091cbc12  (<unknown module>)

SUMMARY: AddressSanitizer: 3242072 byte(s) leaked in 100520 allocation(s).

I find the inclusion of vasprintf most surprising in the stack traces.

jsquyres commented 10 months ago

Thanks for filing this. I dug into it a bit this weekend, and correlated some minor memory leaks back to our main development branch. I ran out of time before finishing my investigation back into the v4.1.x branch; it'll take me a little more time to dig into this properly.