openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.07k stars 409 forks source link

Out of memory in rc_recv_desc memory pool #9902

Open kevinsala opened 1 month ago

kevinsala commented 1 month ago

The MPI program below is getting an out of memory because UCX tries to allocate too many descriptors from the rc_recv_desc memory pool. The program performs thousands of iterations, where each iteration exchanges data from the first process to the last one: process 0 sends to process 1, process 1 sends to process 2, and so on. The first process only sends data, the last only receives, and the rest send and receive. In each iteration, data is exchanged in multiple messages (4096 messages and 4096 bytes per message).

#include <mpi.h>
#include <cstdio>
#include <cstdlib>
#include <unistd.h>
#include <cassert>
#include <cstring>

int main(int argc, char **argv)
{
    constexpr size_t NMessages = 4096;
    constexpr size_t NIterations = 100000;
    constexpr size_t MessageSize = 4096;

    int provided;
    MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
    assert(provided == MPI_THREAD_MULTIPLE);

    int rank, nranks;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &nranks);

    char *buffer = (char *) malloc(MessageSize*NMessages*sizeof(char));
    assert(buffer != NULL);

    MPI_Request *requests = (MPI_Request *) malloc(NMessages*sizeof(MPI_Request));
    assert(requests != NULL);

    for (size_t it = 0; it < NIterations; ++it) {
        if (rank > 0) {
            for (size_t msg = 0; msg < NMessages; ++msg)
                MPI_Irecv(&buffer[msg*MessageSize], MessageSize, MPI_CHAR, rank-1, msg, MPI_COMM_WORLD, &requests[msg]);
            MPI_Waitall(NMessages, requests, MPI_STATUSES_IGNORE);
        }

        if (rank < nranks-1) {
            for (size_t msg = 0; msg < NMessages; ++msg)
                MPI_Isend(&buffer[msg*MessageSize], MessageSize, MPI_CHAR, rank+1, msg, MPI_COMM_WORLD, &requests[msg]);
            MPI_Waitall(NMessages, requests, MPI_STATUSES_IGNORE);
        }

        fprintf(stderr, "[%d] finishing iteration %zu\n", rank, it);
    }

    free(buffer);
    free(requests);

    MPI_Finalize();
}

The out of memory error is always in the process 1. It seems that the process 0 is executing iterations significantly ahead of the other ones (I guess because MPI_Waitall in process 0 does not synchronize with the receives of process 1). For instance, when the application crashes, the process 0 just executed the iteration 7152, while the rest just processed the iteration 3743:

[...]
0: [0] finishing iteration 7150
0: [0] finishing iteration 7151
2: [2] finishing iteration 3743
3: [3] finishing iteration 3743
1: [1] finishing iteration 3743
0: [0] finishing iteration 7152
1: slurmstepd: error: Detected 1 oom_kill event in StepId=2219792.0. Some of the step tasks have been OOM Killed.
srun: error: gs30r2b46: task 1: Out Of Memory
srun: Terminating StepId=2219792.0
0: slurmstepd: error: *** STEP 2219792.0 ON gs30r2b31 CANCELLED AT 2024-05-25T20:30:12 ***
srun: error: gs30r2b57: task 2: Terminated
srun: error: gs30r2b31: task 0: Terminated
srun: error: gs30r2b59: task 3: Terminated
srun: Force Terminated StepId=2219792.0

I'm attaching a PDF with the heap profile of process 1 using gperftools: memory.pdf. The profile shows the memory consumed at the last moments before the out of memory error (consuming around 110 GB). Most memory is allocated by the ucp_worker_progress call inside MPI_Waitall.

At the last moments of the execution, the debug information of UCX (UCX_LOG_LEVEL=debug) printed by process 1 is the following:

[...]
1: [1716661676.533211] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b06200018 of 39845864 bytes with 4752 elements
1: [1716661676.549319] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b08800018 of 39845864 bytes with 4752 elements
1: [1716661676.565373] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b0ae00018 of 39845864 bytes with 4752 elements
1: [1716661676.576079] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b0d400018 of 39845864 bytes with 4752 elements
1: [1716661676.587002] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b0fa00018 of 39845864 bytes with 4752 elements
1: [1716661676.597933] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b12000018 of 39845864 bytes with 4752 elements
1: [1716661676.614437] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b14600018 of 39845864 bytes with 4752 elements
1: [1716661676.625927] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b16c00018 of 39845864 bytes with 4752 elements
1: [1716661676.642146] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b19200018 of 39845864 bytes with 4752 elements
1: [1716661676.652987] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b1b800018 of 39845864 bytes with 4752 elements
1: [1716661676.663934] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b1de00018 of 39845864 bytes with 4752 elements
1: [1716661676.680198] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b20400018 of 39845864 bytes with 4752 elements
1: [1716661676.691382] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b22a00018 of 39845864 bytes with 4752 elements
1: [1716661676.702194] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b25000018 of 39845864 bytes with 4752 elements
1: [1716661676.718434] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b27600018 of 39845864 bytes with 4752 elements
1: [1716661676.734567] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b29c00018 of 39845864 bytes with 4752 elements
1: [1716661676.745502] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b2c200018 of 39845864 bytes with 4752 elements
1: [1716661676.761600] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b2e800018 of 39845864 bytes with 4752 elements
1: [1716661676.772744] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b30e00018 of 39845864 bytes with 4752 elements
1: [1716661676.783711] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b33400018 of 39845864 bytes with 4752 elements
1: [1716661676.871384] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b35a00018 of 39845864 bytes with 4752 elements
1: [1716661676.902518] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b38000018 of 39845864 bytes with 4752 elements
1: [1716661676.927712] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b3a600018 of 39845864 bytes with 4752 elements
1: [1716661676.952377] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b3cc00018 of 39845864 bytes with 4752 elements
1: [1716661676.967374] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b3f200018 of 39845864 bytes with 4752 elements
1: [1716661676.978768] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b41800018 of 39845864 bytes with 4752 elements
1: [1716661676.990236] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b43e00018 of 39845864 bytes with 4752 elements
1: [1716661677.013425] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b46400018 of 39845864 bytes with 4752 elements
1: [1716661677.038073] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b48a00018 of 39845864 bytes with 4752 elements
1: [1716661677.063005] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b4b000018 of 39845864 bytes with 4752 elements
1: [1716661677.092805] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b4d600018 of 39845864 bytes with 4752 elements
1: [1716661677.117955] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b4fc00018 of 39845864 bytes with 4752 elements
1: [1716661677.142417] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b52200018 of 39845864 bytes with 4752 elements
1: [1716661677.171971] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b54800018 of 39845864 bytes with 4752 elements
1: [1716661677.196688] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b56e00018 of 39845864 bytes with 4752 elements
1: [1716661677.221304] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b59400018 of 39845864 bytes with 4752 elements
1: [1716661677.245909] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b5ba00018 of 39845864 bytes with 4752 elements
1: [1716661677.270492] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b5e000018 of 39845864 bytes with 4752 elements
1: [1716661677.472116] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b60600018 of 39845864 bytes with 4752 elements
1: [1716661677.629235] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b62c00018 of 39845864 bytes with 4752 elements
1: [1716661677.715225] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b65200018 of 39845864 bytes with 4752 elements
1: [1716661677.782252] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b67800018 of 39845864 bytes with 4752 elements
1: [1716661696.847464] [gs30r2b46:1088654:0]           mpool.c:287  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x1b69e00018 of 39845864 bytes with 4752 elements

Environment

The executions use MPICH 4.2.1 over UCX 1.16.0, but I've observed the same error with previous UCX releases, and also OpenMPI 4.1.6 over UCX.

Commands to reproduce

I can reproduce this error running on four processes across four nodes:

$ export UCX_NET_DEVICES=mlx5_0:1
$ mpicxx -O3 reproducer.cpp -o reproducer
$ srun -l --cpu_bind=verbose --hint=nomultithread -N 4 -n 4 -c 56 ./reproducer
yosefe commented 1 month ago

@kevinsala can you try forcing synchronization between sender and receiver by using MPI_Issend (instead if MPI_Isend) for the last message in every window (msg == NMessages-1)

kevinsala commented 1 month ago

Using synchronous sends for the last message in every iteration avoids the out of memory error. The memory usage keeps around 180 MB per process.

Is there any throttling mechanism inside UCX to avoid this issue (without a workaround in the application side)?

We observe this problem in a task-based MPI+OpenMP application that does not use MPI_Waitall but MPI_Testsome. Once a request completes, the next message from the next iteration (same msg tag) can be issued, allowing messages from different iterations (but with distinct tags) to be in flight simultaneously. As we do not use MPI_Waitall in each iteration, I believe that using MPI_Issend for a message would not be sufficient.

yosefe commented 1 month ago

Currently there is no throttling mechanism in UCX for unexpected tags, though it's a good feature to add. Is it possible to add a blocking MPI_Ssend once a while to create such synchronization?