opal/cuda: Handle stream-ordered allocations and assign primary device context

Akshay-Venkatesh commented 1 month ago

This PR is similar to https://github.com/open-mpi/ompi/pull/12757 and it detects if a given pointer belongs to CUDA memory pools.

Additionally, we assign primary device context to calling thread in the absence of a device context as VMM and memory pool pointers do not have a device context associated with them by design. While ideally we release references on the primary context by making an equal number of cuDevicePrimaryCtxRelease calls as cuDevicePrimaryCtxRetain calls, unfortunately there is no good place to make the release call as we don't know up front when the last reference to the given pointer will be in process lifetime (especially as we don't intercept user CUDA calls to free said memory). For this reason, there will be at most one unreleased reference against the primary device context per user process thread and this shouldn't have any undesired effects on GPU resources as such.

Akshay-Venkatesh commented 1 month ago

Have you ever checked that the mpool are supported in the setup (using the CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED attribute for cuDeviceGetAttribute) ?

Hi @bosilca I have not. Are you suggesting we check the attribute and always return 0 for cuda_check_mpool if mpools are unsupported? If so, DeviceAttribute queries are more expensive and I would say that Pointer Attribute query we have effectively does the same but with lesser overhead. But if we assume homogeneity of GPUs then I agree that your suggestion would eliminate even the pointer query if we have a static variable initialized to reflect mpool support. Question then is if we can assume homogeneity. What are your thoughts on this?

bosilca commented 1 month ago

We can determine all that upfront. Check for mpool support once for each visible devices, generate a warning if they are different and activate the most generic checks for the pointers. If homogeneous, we can remove some of the checks.

I wonder how many heterogeneous setups are really used for anything else than toys ?

Akshay-Venkatesh commented 1 month ago

I wonder how many heterogeneous setups are really used for anything else than toys ?

Indeed this will not be common.

open-mpi / ompi

opal/cuda: Handle stream-ordered allocations and assign primary device context #12835