vyasr commented 1 month ago

Problem statement

cudf.pandas has substantially increased the number of users attempting to use cuDF on workloads that trigger out of memory (OOM) errors. This is particularly problematic because cuDF typically OOMs on datasets that are far smaller than the available GPU memory due to the overhead of various algorithms, resulting in users being unable to process datasets that they might reasonably expect to. Addressing these OOM errors is one of the highest priorities for cuDF to enable users with smaller GPUs, such as consumer cards with lower memory. Unified Memory is one possible solution to this problem since algorithms are no longer bound by the memory available on device. RMM exposes a managed memory resource so users can easily switch over to using unified memory. However, naive usage of unified memory introduces severe performance bottlenecks due to page faulting, so simply switching over to unified memory is not an option for cuDF or libcudf. Before we can use unified memory in production, we need to implement mitigating strategies to avoid faults using either hinting or prefetching to trigger a migration. Here we propose using systematic prefetching for this purpose.

Goals:

Allow cudf.pandas users to achieve reasonable performance on 30-50% oversubscribed workloads without degrading performance on undersubscribed workflows by more than 20%.
Enable at least some degree of oversubscription in 24.08, even if it requires making short-term compromises in cuDF or libcudf's design. Having some level of support is time-sensitive here due to the large number of users impacted.
Avoid introducing performance costs to other consumers of libcudf (such as Spark RAPIDS or cuDF without cudf.pandas) when they are not using unified memory.
Have a plan for longer-term solutions to achieve the same goals in ways that do not compromise libcudf's design. Any short-term solution for which we do not have a longer term plan is not acceptable.

Non-Goals:

Support workflows where a single column exceeds available GPU memory. Unified memory will thrash heavily in this case and will lead to unacceptable performance, and there's nothing clever we can do. Solving this use case (if we decide it is in scope) will require algorithmic changes to all of libcudf to leverage approaches like chunking and streaming. There is no shortcut with unified memory.

Short-term Proposal (mix of 24.08 and 24.10)

We need to make an expedient set of changes to enable prefetching when using managed memory. While cudf.pandas is the primary target in the immediate term, we cannot realistically achieve this purely in the Python layer and will need some libcudf work. With that in mind, I propose the following changes:

Implement a new PrefetchMemoryResource that performs a prefetch when data is allocated. This is important because injecting prefetches in cuIO is more challenging than in the rest of libcudf, so prefetching on allocate is a short-term fix that ensures buffers are prefetched before being written to in cuIO.
Add a prefetch call to column_view/mutable_column_view::head.
Subclass rmm::device_uvector to create cudf::device_uvector and add a prefetch call to cudf::device_uvector::data. All internal uses of rmm::device_uvector should be replaced with cudf::device_uvector, but functions returning rmm:device_uvector need not be changed.
Add a global configuration option in libcudf to turn prefetching on or off. All three of the above prefetches (and any others added)

Items 2-4 are implemented in #16020 (3 is partially implemented; a global find-and-replace is still needed). Item 1 has been prototyped as a callback memory resource in Python for testing but needs to be converted to a proper C++ implementation.

This plan involves a number of compromises, but it offers a number of significant advantages that I think make it worthwhile to proceed in the short term.

2 and 3 are a minimal set of changes to ensure that nearly all device data reads and writes in libcudf outside of cuIO will occur on data that is resident on device. I can see no other option that would achieve comparable coverage without far more extensive and far-reaching changes.
1 is a reasonable shortcut to allow cuIO (read) operations to also write to device-resident buffers without having to modify cuIO data structures, which are more complex than the typical libcudf data structures and do not have quite as simple a set of entry points with which to control data access.
4 allows us to prevent 1-3 from introducing any overhead in situations where prefetching is not intended to be active.

The drawbacks of this plan:

4 introduces global state to libcudf, and supporting this long term may be undesirable (and may grow overly complex over time if requirements like thread-safety come into play).
Since data accessors are not currently stream-ordered, this solution requires using the default stream for prefetching, which is contrary to our goal of making libcudf a stream-ordered library.
2-3 are fairly invasive changes to core classes that cannot be opted out of. Every part of libcudf will start using prefetching (when the config is enabled) whether they want to or not.
We can no longer use certain rmm classes directly (device_uvector, and we may eventually find that we need cudf::device_buffer too), and we are perhaps adding functionality that really should be added to rmm instead.
1 will lead to severe thrashing in code paths that allocate more output memory than is available on the GPU, so prefetching on access is strictly preferable if it can be done in every code path that requires it.

Long-term plans

Here we lay out various potential long-term solutions to address the concerns above.

Adding new pointer accessors for prefetching

Instead of modifying the behavior of existing data accessors and gating the behavior behind a configuration, we could instead introduce new data accessors. For instance, we could add column_view::data_prefetch(rmm::cuda_stream_view).

Pros:

Having our algorithms use this accessor instead of a config option makes the prefetch much more obvious when reading the code.
Explicit usage makes prefetching opt-in so algorithms that don't want to don't have to enable it.
No need for global configuration.

Cons:

Explicit opt-in means we will have to build up coverage incrementally. This is not feasible for 24.08.
Without a dedicated test suite designed to capture cases where we forget to prefetch, we won't have any signal that it's missing other than nsys profiling or user-reported performance bottlenecks.

Using a macro of some sort to trigger prefetching

Instead of adding new accessors, we could add a macro that could be inserted into algorithms to indicate a set of columns that need to be prefetched. This approach has essentially the same pros and cons as the above, so it's really a question of which implementation we prefer if we choose to go either of these routes.

Adding prefetching to rmm data structures

Pros:

A better home for this functionality that could be reused in other libraries outside cudf.
cudf can continue using rmm classes instead of needing to maintain its own versions of them.

Cons:

If prefetching is managed via a template parameter or a subclass, cudf would still need to make some decision at compile-time.
If prefetching is managed at runtime, we still need some sort of configuration, it is just either pushed up to rmm or a cudf option deciding what rmm object to instantiate.

Updating cuIO to properly handle prefetching

Updating cuIO data structures to properly handle prefetching is a long-term requirement.

Pros:

Allows us to remove the prefetching allocator, which is effectively a stopgap for not handling prefetching properly in cuIO.

Cons:

This is more challenging than updating the other libcudf data structures, and will take time.

davidwendt commented 1 month ago

I'd like us to consider an alternate libcudf implementation that is more work but may be better in terms of control and maintenance going forward. I believe we could build a set of utilities that accept pointers or a variety of container types that perform the prefetch and then insert the prefetch/utility calls before each kernel launch. This provides the best control to the algorithm author when and what is prefetched with no surprises or side-effects.

I'd like to keep logic like this out of the containers (column_view and device_uvector). I feel these introduce hidden side-effects that would be difficult to avoid similar to the lazy-null-count logic that was removed several releases ago. I know this is more work but I think having the logic inline with the kernel launches will be easier to maintain and control. We can easily decide which algorithms need prefetching (and when , how, and which parts) and iteratively work on specific chunking solutions in the future without effecting all the other APIs.

vyasr commented 1 month ago

I concur with your assessment long term, but as detailed in the issue I don't think it is feasible on the timeline we are seeking. Inserting changes before every kernel launch, even fairly trivial changes, seems like a task that will take at least one full release since the initial work will require achieving consensus on what those changes should be.

Is there something I wrote in the issue that you disagree with? I tried to address pretty much this exact concern in the issue since I share it and anticipated that others would raise it at this point.

davidwendt commented 1 month ago

Is there something I wrote in the issue that you disagree with? I tried to address pretty much this exact concern in the issue since I share it and anticipated that others would raise it at this point.

I only disagree with modifying column_view and subclassing device_uvector even in the short term. The first makes me uneasy for the codebase because of its global nature. It likely will not hit all the desired code paths and may cause unnecessary prefetching in other cases (causing more workarounds, etc). The subclassed device_uvector requires a wide change to the codebase on a similar scale that I was proposing so it does not save us that much work.

I'm was hoping that we can add prefetch to a few APIs quickly using a targeted approach with a handful of utilities in the short term and then roll out the rest in the long term.

vyasr commented 1 month ago

I'm was hoping that we can add prefetch to a few APIs quickly using a targeted approach with a handful of utilities in the short term and then roll out the rest in the long term.

The problem I see with that approach is that while we might be able to see good results on a particular set of benchmarks that way, we will not be able to enable a managed memory resource as default without substantially slowing down a wide range of APIs (anything that doesn't have prefetching enabled). We should at minimum test running the cudf microbenchmarks with a managed memory resource. I suspect that the results will not support using a managed memory resource by default in cudf.pandas without the more blanket approach for prefetching, unless we choose to wait for the longer term solution where we roll out your proposed changes to more APIs.

vyasr commented 1 month ago

Copying from Slack:

We came to the following compromise during the discussion:

We will merge the column_view/mutable_column_view changes from #16020 to allow prefetching to occur on the widest set of APIs possible in the short term.

We will not merge the device_uvector changes because that requires touching many places. Instead, we will find everywhere that we would need to make such changes, and instead insert manual prefetch calls like in #16265. Since that is the long term solution that we prefer anyway, we should do that instead of changing device_uvector since it's the same number of places that need changing. My hope would be that in the short term these would all be prefetches on device_uvectors or device_buffers, the places where we know the above solution has no effect

We will include the prefetch allocator

We will keep the configuration options in place

Over the course of the next couple of months, we will run libcudf benchmarks and cudf Python microbenchmarks using managed memory and identify hot spots that need manual prefetching added. As we do this, we will turn off the column_view prefetching so that we ensure that we're capturing all of the same needs. Once we are satisfied, we will remove prefetching from column_view

I'm going to work on updating 16020 today to remove the undesirable changes, then David and I will aim to get his changes merged in tomorrow

rapidsai / cudf

[Story] Enabling prefetching of unified memory #16251

Problem statement

Short-term Proposal (mix of 24.08 and 24.10)

Long-term plans

Adding new pointer accessors for prefetching

Using a macro of some sort to trigger prefetching

Adding prefetching to rmm data structures

Updating cuIO to properly handle prefetching