Reduce collectives latency by reusing temporary buffers

Problem Statement

Open MPI default collectives have a common use of temporary buffers(host memory) of variable sizes to instrument data transfer from remote peers to its local destination buffer. Typically, the buffer is dynamically allocated at the beginning of each collectives routine via malloc, and is used as the send/receive buffer in the PML and free’d at the end. As of May 2024, Open MPI does not explicitly cache the temporary buffers for reuse in collectives.

Temporary buffers induce latency overhead in 2 ways:

Primary cost: The buffer needs to be pinned in order for the NIC to read/write via DMA. In the context of libfabric and rdma-core, this is known as memory registration. The overhead includes user/kernel space context switches plus the device processing time(generating RDMA access keys + bookkeeping virtual/physical memory mapping). This overhead is 5-20X network round-trip latency for EFA.
Secondary cost: Memory allocation and de-allocation via malloc/free can be slow. This is more pronounced at large buffer sizes > 1MB, but can be negligible compared to the network latency for small buffers.

Feature Proposal

We want to reuse the temporary buffer as much as possible in collectives. In the context of EFA, the buffer is pinned inside libfabric when it's used in PML send(recv) for the first time; subsequent send(recv) operations on the same buffer do no require re-pinning the buffer thanks to EFA provider's memory registration cache(same concept in UCX). This leads us to a buffer pool approach.

Collectives should opt in and request buffers from the bufpool when, and only when, the buffer is used to send/recv data from the network layer. The requested buffer should be returned to the bufpool afterwards. Collectives shall not free the memory of the buffer.
The bufpool should have a configurable range of buffer sizes as multiples of page sizes, i.e. buckets. This is to accommodate the unpredictable data size and count from application.
The bufpool should be able to grow during runtime, when more buffers are needed.
The bufpool should have a max capacity(user configuration?) which is the total size of all retained buffers. As mentioned above, the solution should have a small memory footprint in Open MPI. When the bufpool is close to or exceeds its capacity, it should release buffers to reduce its size; meanwhile the application should still be able to request buffers even if that allocates more memory beyond the bufpool’s capacity.
The bufpool should be thread-safe. It should not only provision different buffers to different threads, but also update attributes including pool size, total buffer size, etc. in a thread-safe manner.
[Stretch Goal] The bufpool should shrink when buffers become cold, i.e. garbage collection, in order to minimize Open MPI's total memory footprint.

This can be implemented on top of:

opal/mca/allocator/bucket, or
opal/mca/mpool
[bare-bone] A(couple) free list(s) each tracking fix-sized buffers in collectives

Thinking Forward

After discussions with Brian, we realize an improvement opportunity inside the collective algorithms. Currently most collectives send(recv) all application data(payload) in a single shot regardless of the network bandwidth. If the communication requires a temporary buffer, the buffer size will be proportional to the payload size, which can be very large, e.g. GBs. Assuming a network single-trip latency of 1 us and bandwidth of 100 Gbps, for each trip it can only send 100Gbps*1us/8= 12.5 KB - allocating GBs memory is a waste because the network can only access a fraction of it at a given point of time.

Following this logic, we can infer that there is an upper bound for the send(recv) buffer size for a given hardware platform to achieve maximal throughput, regardless of the payload size. This can be implemented with data segmentation per send(recv), i.e. a pipeline. In this approach, we do not need to worry about very large temporary buffers.

open-mpi / ompi