[BUG] Potential for use-after-free on libcudf/cudf interface boundaries

Describe the bug

This comes out of a review of #14133 which introduces a new Scalar type on the python side in pylibcudf, but I think that the issues are pervasive.

Background

libcudf uses RMM for all memory allocations, this happens through a memory_resource object. Allocating libcudf functions all take an explicit memory_resource argument (as a raw pointer) that is defaulted to rmm::get_current_device_resource() (the resource set by rmm::set_per_device_resource). RMM device_buffers hold a raw pointer of their allocating memory resource (needed for deallocation), and it is the user's (in this case libcudf's) responsibility to keep that memory_resource alive until the device_buffer has been dropped:

This is fine:

memory_resource mr = ...;
{
    device_buffer buf{..., &mr};
    ... // do things with buf
   ~buf(); // called here, fine.
}

This is not (this is not a real example, since I think the device_buffer constructors don't allow exactly this, but bear with me):

device_buffer buf;
{
   memory_resource mr = ...;
   buf = {..., &mr};
   ... // do things with buf
   ~mr(); // called here, boom;
}
buf no longer valid

How does `get_current_device_resource` work?

set_per_device_resource stores a raw pointer to a memory resource in a std::unordered_map and get_current_device_resource just looks up in the map. Again, the user is responsible for keeping the mr alive.

How does this work in cudf (python) land?

The Cython wrappers in RMM expose memory resources, and set/get_current_device_resource. set_per_device_memory_resource sets the value in both the C++ level std::unordered_map and a Python level dict (https://github.com/rapidsai/rmm/blob/5f07014db51535902f8cdb515596af162d1ae6ca/python/rmm/_lib/memory_resource.pyx#L1041-L1049). get_current_device_resource looks up in the Python dict (https://github.com/rapidsai/rmm/blob/5f07014db51535902f8cdb515596af162d1ae6ca/python/rmm/_lib/memory_resource.pyx#L1013-1026). DeviceBuffers keep the "current" memory resource alive by storing a reference: https://github.com/rapidsai/rmm/blob/5f07014db51535902f8cdb515596af162d1ae6ca/python/rmm/_lib/device_buffer.pyx#L93-L93, but when taking ownership of a C++-level device buffer https://github.com/rapidsai/rmm/blob/5f07014db51535902f8cdb515596af162d1ae6ca/python/rmm/_lib/device_buffer.pyx#L161-L171, we don't use the mr stored in the C++ struct.

The Python level dict is "necessary" due to the usual expected way in which people will use things from python. That is they will expect that: set_per_device_resource(device, CudaMemoryResource()) keeps the created memory resource alive.

If a C++ library calls set_per_device_resource at some point, C++-land (libcudf) and Python-land (cudf) can end up disagreeing about what the current memory resource is (because Python-level get_current_device_resource doesn't look at the C++-level map).

So what's the problem?

If libcudf ever allocates memory that it hands off to cudf with a memory resource that is not managed by cudf, we have the potential for use-after-free, because cudf has no way of taking (or sharing) ownership of that memory resource from the RMM buffer.

AFAICT, cudf never explicitly passes a memory resource into libcudf, and so the allocation behaviour is always relying on C++ and Python agreeing about what get_current_device_resource returns, and that cudf was the creator of that resource. Effectively the pattern is:

# In python
...
1. mr = rmm.get_current_device_resource()
# No mr passed here, so C++-level get_current_device_resource() is used
2. new_column = libcudf.some_algorithm(existing_data)
# take ownership of the data, fingers crossed that new_column.data.mr is mr
3. cudf_column = Column.from_unique_ptr(new_column, mr)

If either someone in C++ has set a different per-device resource or the current thread is pre-empted between lines 1 and 2 (and the current device resource is changed), then line 3 will take "ownership" of the data, but not be keeping the correct memory resource for the data alive.

How can we fix this?

If the Cython layer in cudf always explicitly passes a memory resource to libcudf algorithms, this problem goes away. We can then always guarantee that the memory resource we have on the Python side is the one used to allocate the data in libcudf so we can keep it alive.

Alternately, if the memory_resource pointers in RMM were smart pointers it might be possible to keep things alive that way. Right now we can't make Python and C++ always agree on what the current default resource is (because on the C++ side RMM doesn't have a smart pointer stored, because it doesn't take ownership).

rapidsai / cudf