openucx / ucc

Unified Collective Communication Library
https://openucx.github.io/ucc/
BSD 3-Clause "New" or "Revised" License
195 stars 96 forks source link

Correct cleanup of internal resources associated with communicators #925

Open bosilca opened 7 months ago

bosilca commented 7 months ago

This issue is not entirely UCC related, but relates to UCC integration into an MPI library (in this case Open MPI).

Trigger: If an application calls MPI_Finalize without properly freeing all communicators there is no way to properly release UCC resources.

The root cause is the way UCC tracks OMPI communicators destruction, not by refcounting the collective referenced by the communicator but by adding an attribute (COMM_ATTR) on each communicator and expecting the attribute to be properly deleted (which will call into ucc_comm_attr_del_fn). This is the case when the user free a communicator via MPI_Comm_free, but the MPI standard does not required the release of attributes for non freed communicators in MPI_Finalize (at least not for attributes not associated with MPI_COMM_WORLD). The outcome is a disconnect between OMPI (where all communicators are released) and UCC where they are not.

A proper solution could be complicated to implement. It would require to refcount all the collectives used by OMPI, or to track all communicators internally and release them when the attribute deletion function (ucc_comm_attr_del_fn) is called on MPI_COMM_WORLD.

Sergei-Lebedev commented 4 months ago

@bosilca can we close this issue since https://github.com/open-mpi/ompi/pull/12429 was merged?

bosilca commented 4 months ago

Technically yes, but this is not completely over yet for UCC and HCOL. In open-mpi/ompi#12429 I have not removed the attribute on MPI_COMM_WORLD that both libraries are using to detect MPI_Finalize. I don't think it is needed anymore, but such a change must come from y'all.

Sergei-Lebedev commented 4 months ago

I see, let's keep this issue open then, thanks for update