Open wence- opened 1 month ago
Side thought: Maybe we should experiment with replacing the cuda_memory_resource
used for initial_resource
with a cuda_async_memory_resource
...
Side thought: Maybe we should experiment with replacing the
cuda_memory_resource
used forinitial_resource
with acuda_async_memory_resource
...
Maybe, though we'd have the usual static destruction problems, so we'd never explicitly free that memory pool.
It might also be problematic in the multiple library case where one library is not configured with a specific pool, so makes an allocation from the initial_resource
which now builds a pool. And now we're conflicting (potentially) with other libraries that then set a pool up.
I was thinking by default the async resource uses the default pool, which we would not own. Maybe I'm misremembering how it's implemented.
The async resource will use the pool managed by the CUDA driver, which we do not own and would probably be fine. Ideally everyone would use that and then all pooling would be handled by the driver. If we use the async mr by default and a different library does not but constructs their own pool manually using a different underlying allocation routine (e.g. cudaMalloc instead of cudaMallocAsync), then we could conflict.
In cuda_async_memory_resource
we call cudaMempoolCreate
and get a handle to a pool and use that to make our allocations. So that sounds like we own that pool.
My mistake, I didn't realize that we were allocating from a specific pool that we created. The failure mode should still be relatively graceful if two processes both use the async allocation routines and one pool blocks another's growth. I don't think it will be as graceful if you mix and match async with non-async allocation, but I could be wrong there.
I believe the reason that cuda_async_memory_resource
owns its pool is because we provide a non-owning MR: cuda_async_view_memory_resource
. We could use the latter as the default resource with the default pool. That MR requires passing a cudaMemPool_t
pool handle, and the default pool handle can be retrieved using cudaDeviceGetDefaultMempool
.
Perhaps, however, we should wait to make this default change until we can start using the cuda_async_memory_resource
from libcu++, which has a different design.
RMM has multiple pool-like allocators:
pool_memory_resource
that wraps a coalescing best fit suballocator around an upstream resource;arena_memory_resource
that similarly wraps around an upstream resource but divides the global allocation into size-binned arenas to mitigate fragmentation when allocating/deallocating;cuda_async_memory_resource
that uses the memory pool implementation provided bycudaMallocAsync
. This one can avoid fragmentation because it is in control of the virtual address space.Since these are all composable, one can happily wrap a
pool_memory_resource
around acuda_async_memory_resource
(or an arena, ...). But should one?It would be useful if the documentation provided some guidance on which combinations make sense, and what typical allocation scenarios best fit a particular pool.
We should also recommend best practices for picking initial pool sizes: a bad choice here can lead to overfragmentation.