[FEA] Use RMM memory pool by default

shwina commented 1 year ago

We should move to using an RMM managed memory pool by default.

This was brought up before in https://github.com/rapidsai/cudf/issues/2676. In response to that issue, we implemented set_allocator, https://github.com/rapidsai/cudf/pull/2682, but we chose not to enable the RMM pool by default (likely because we didn't want to monopolize GPU memory away from other libraries).

Since then, CuPy, Numba (and soon PyTorch) all can be configured to use RMM, and therefore share the same memory pool as cuDF.

Proposal

Concretely, the proposal is that import cudf will:

Set RMM's default memory resource to a pool memory resource
Configure CuPy, Numba, (and PyTorch?) to all use RMM's default memory resource

What should the initial and maximum pool size be?

An RMM pool can be configured with an initial and maximum pool size. The pool grows according to an implementation-defined strategy (see here for the current strategy).

As we cannot assume that all (or any) GPU memory is available when cuDF is imported, the initial pool size should be 0 bytes.
The only reasonable maximum pool size I can think of is the maximum available GPU memory. If the pool cannot expand to this size because of allocations made outside of RMM, so be it: we will OOM.

What happens if `import cudf` appears in the middle of the program?

All this works well if import cudf appears at the beginning of the program, i.e., before any device memory is actually allocated by any library). However, if it appears after some device objects have already been allocated, it can lead to early out-of-memory errors. As an example, consider some code that uses both PyTorch and cuDF in the following way:

import torch

# this part of the code uses PyTorch

import cudf

# this part of the code uses cudf

Because PyTorch uses a caching allocator, a memory pool already exists by the time we import cuDF. Importing cuDF initializes a second pool that all libraries (including PyTorch) will use going forward. The first pool essentially becomes a kind of dead space: no new device objects are ever allocated within the pool, and no device memory is ever freed from it.

There's no perfect solution I can think of to this particular problem, but it's probably a good idea to call empty_cache() before resetting the PyTorch allocator to minimize the amount of wasted memory.

That's just one example of the kind of issues that can arise if import cudf appears later. I think it's fine to assume this will be less common than importing it at the beginning.

vyasr commented 1 year ago

Some thoughts:

cuda-python/pynvml have APIs for checking total device memory usage (e.g. cudaMemGetInfo). Should we inject some checks at import time to ensure that the entire GPU is available, or maybe print some initial stats on the max pool size based on available GPU memory?
Can we wrap the allocator in an adapter that checks (on OOM) whether memory has been allocated outside of the pool (perhaps just using a naive metric like total_gpu_mem - pool_size) and provides the user a more useful error?
I'm always wary of making implicit and far-reaching changes like this silently. Should we consider adding some notification on import cudf that indicates that such a change is being made? I also don't like libraries that print information (especially warnings) unconditionally on import, so I don't particularly like this suggestion, just raising the possibility. Alternatively it could be set up via logging of some sort. I think we can safely assume that no matter how we broadcast this some users will not read the docs/notification/warning/etc and be surprised at the behavior, but hopefully (and I think it's likely that) more people are happy about the pool usage improving default performance.
Similar to the PyTorch example where cudf is imported later, could this change potentially break scripts that are using rmm to allocate memory prior to the cudf import? I believe that the answer is no since we made the necessary changes to save an allocator used for any allocation such that set_allocator won't make accessing allocations using the prior allocator UB, just want to be sure.

shwina commented 1 year ago

cuda-python/pynvml have APIs for checking total device memory usage (e.g. cudaMemGetInfo). Should we inject some checks at import time to ensure that the entire GPU is available, or maybe print some initial stats on the max pool size based on available GPU memory?

I don't know what (if anything) this does for us, and perhaps my concerns about the max_pool_size were unfounded to begin with. I think the best option is simply not to set a maximum pool size and let the pool grow to whatever size it can. This is what happens today if you do cudf.set_allocator(pool=True). Setting the max_pool_size to the amount of memory available at import time may be a bad idea because additional memory might be freed up later.

Can we wrap the allocator in an adapter that checks (on OOM) whether memory has been allocated outside of the pool (perhaps just using a naive metric like total_gpu_mem - pool_size) and provides the user a more useful error?

We can. I'm not opposed to it, but what can the user do with that information? When the GPU is e.g., being shared by multiple processes, it may even be misleading to say that their code is allocating memory outside of RMM's knowledge.

Should we consider adding some notification on import cudf that indicates that such a change is being made?

Given that other libraries like CuPy and PyTorch also use a pool by default, I don't know if this is very controversial. I want to say that 99% of users won't ever know or care.

could this change potentially break scripts that are using rmm to allocate memory prior to the cudf import?

I don't believe so (especially if we choose an initial pool size of 0). You are right that we keep a reference to the previous memory resource around so there's no danger/UB.

vyasr commented 1 year ago

This question came up again today. Personally I tend to be against this sort of implicit behavior and prefer that it always be opt-in, but it seems like many feel that the benefit to users outweighs potential risks (but not all, CC @wence- and @pentschev). That said, increasing cudf import times is a concrete cost that I probably wouldn't be willing to stomach for pytorch, whereas for cupy and numba we are importing them already so the question is more of a philosophical one of whether or not we should have side effects to importing cudf.

wence- commented 1 year ago

In 23.04, importing RMM will no longer hook up cupy, numba, and pytorch to use the RMM allocators (that was https://github.com/rapidsai/rmm/pull/1221).

That was mostly done for import time reduction, but it also has the consequence that importing rmm doesn't modify the stateful allocation environment.

Within cudf, since we use cupy and numba allocated data in some places, it makes sense to configure those libraries (when imported) to use the same allocator that cudf uses. For now at least, pytorch is not imported by cudf, so it's not as critical that we configure things.

That said, if we do go with pool-based allocation by default, then the likelihood of early (not actually) OOM does go up. My feeling is that we should solve this with a "best practices for interop" guide rather than trying to pin down all the possible libraries that might interoperate on device memory with cudf.

In general I tend not to like libraries that configure themselves into too many third-party systems (it always feels more "framework" than "library").

VibhuJawa commented 1 year ago

There are two things here we are discussing here.

Using RMM Pool by default

This comes down to choosing perf over potential memory contention from other libraries .

In all my interactions with customers when they are just using cuDF without dask i have never seen them use rmm pool. I agree with the technical challenges of doing this but the actual impact of this is 99% of customers will never use POOL and we will have to be fine with it.

Configuring Pytorch to use rmm torch allocator.

Regarding 2, My fear is that most users wont know if this is configurable and I think there is also no clean way for us to inform user when they run into OOMs .

That said , we cant probably use the RMM Pool cleanly right now with Pytorch as that can lead to fragmentation. We will have to probably use the async memory allocator.

Example of what a user will have to configure currently is below:

    import cudf
    import cupy
    import rmm

    mr = rmm.mr.CudaAsyncMemoryResource()
    rmm.mr.set_current_device_resource(mr)
    torch.cuda.memory.change_current_allocator(rmm.rmm_torch_allocator)
    cupy.cuda.set_allocator(rmm.allocators.cupy.rmm_cupy_allocator)

jakirkham commented 1 year ago

On 2, there's at least a blogpost now. Maybe we can do more to promote that?

VibhuJawa commented 1 year ago

On 2, there's at least a blogpost now. Maybe we can do more to promote that?

I agree that blog post is what I link so that is very useful indeed.

I think having a dedicated place in the cuDF docs for it will be great as a lot of people who use cuDF would probably not know what RMM Pool is (similar to people using Pytorch not knowing about memory management ) so we will need a way to link it from cuDF docs too.

harrism commented 1 year ago

The first line of the PR description says:

using an RMM managed memory pool

Be careful. "managed memory pool" implies a pool_memory_resource with a managed_memory_resource (aka Unified Memory, cudaMallocManaged) as upstream. Are you saying that should be the default? Or do you mean a pool_memory_resource with cuda_memory_resource (cudaMalloc) as upstream?

GregoryKimball commented 5 months ago

There is a great discussion in this issue so far, and it (implicitly) has mostly referred to a default unmanaged (device-only) pool. The issue of a default memory resource is coming up again in the context of preventing OOM errors for the single workstation, beginner cuDF user, where a default managed memory resource could be a good solution.

I'm interested to learn more about the default pool behavior in cupy and pytorch as mentioned above. I'm also interested to learn more about how experienced users and other libraries could automatically be opted-out of a default managed resource, while still providing this feature automatically to the beginner user.

Perhaps we could have a hook on the first column allocation that checks if RMM has ever had reinitialize called, and if not, create a managed pool with initial size zero. Then when the pool grows we can add a hook to check the available system memory and throw an OOM exception well before completely locking up the system.

harrism commented 5 months ago

Another concern I have in this issue is the recommendation to start with a pool of 0 bytes. With the current pool implementation this will create a lot of fragmentation.

vyasr commented 5 months ago

rapidsai / cudf