open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.07k stars 844 forks source link

Reduce collectives latency by reusing temporary buffers #12646

Open wenduwan opened 1 week ago

wenduwan commented 1 week ago

Problem Statement

Open MPI default collectives have a common use of temporary buffers(host memory) of variable sizes to instrument data transfer from remote peers to its local destination buffer. Typically, the buffer is dynamically allocated at the beginning of each collectives routine via malloc, and is used as the send/receive buffer in the PML and free’d at the end. As of May 2024, Open MPI does not explicitly cache the temporary buffers for reuse in collectives.

Temporary buffers induce latency overhead in 2 ways:

Feature Proposal

We want to reuse the temporary buffer as much as possible in collectives. In the context of EFA, the buffer is pinned inside libfabric when it's used in PML send(recv) for the first time; subsequent send(recv) operations on the same buffer do no require re-pinning the buffer thanks to EFA provider's memory registration cache(same concept in UCX). This leads us to a buffer pool approach.

  1. Collectives should opt in and request buffers from the bufpool when, and only when, the buffer is used to send/recv data from the network layer. The requested buffer should be returned to the bufpool afterwards. Collectives shall not free the memory of the buffer.
  2. The bufpool should have a configurable range of buffer sizes as multiples of page sizes, i.e. buckets. This is to accommodate the unpredictable data size and count from application.
  3. The bufpool should be able to grow during runtime, when more buffers are needed.
  4. The bufpool should have a max capacity(user configuration?) which is the total size of all retained buffers. As mentioned above, the solution should have a small memory footprint in Open MPI. When the bufpool is close to or exceeds its capacity, it should release buffers to reduce its size; meanwhile the application should still be able to request buffers even if that allocates more memory beyond the bufpool’s capacity.
  5. The bufpool should be thread-safe. It should not only provision different buffers to different threads, but also update attributes including pool size, total buffer size, etc. in a thread-safe manner.
  6. [Stretch Goal] The bufpool should shrink when buffers become cold, i.e. garbage collection, in order to minimize Open MPI's total memory footprint.

This can be implemented on top of:

  1. opal/mca/allocator/bucket, or
  2. opal/mca/mpool
  3. [bare-bone] A(couple) free list(s) each tracking fix-sized buffers in collectives

Thinking Forward

After discussions with Brian, we realize an improvement opportunity inside the collective algorithms. Currently most collectives send(recv) all application data(payload) in a single shot regardless of the network bandwidth. If the communication requires a temporary buffer, the buffer size will be proportional to the payload size, which can be very large, e.g. GBs. Assuming a network single-trip latency of 1 us and bandwidth of 100 Gbps, for each trip it can only send 100Gbps*1us/8= 12.5 KB - allocating GBs memory is a waste because the network can only access a fraction of it at a given point of time.

Following this logic, we can infer that there is an upper bound for the send(recv) buffer size for a given hardware platform to achieve maximal throughput, regardless of the payload size. This can be implemented with data segmentation per send(recv), i.e. a pipeline. In this approach, we do not need to worry about very large temporary buffers.