rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.19k stars 881 forks source link

[Story] Enabling prefetching of unified memory #16251

Open vyasr opened 1 month ago

vyasr commented 1 month ago

Problem statement

cudf.pandas has substantially increased the number of users attempting to use cuDF on workloads that trigger out of memory (OOM) errors. This is particularly problematic because cuDF typically OOMs on datasets that are far smaller than the available GPU memory due to the overhead of various algorithms, resulting in users being unable to process datasets that they might reasonably expect to. Addressing these OOM errors is one of the highest priorities for cuDF to enable users with smaller GPUs, such as consumer cards with lower memory. Unified Memory is one possible solution to this problem since algorithms are no longer bound by the memory available on device. RMM exposes a managed memory resource so users can easily switch over to using unified memory. However, naive usage of unified memory introduces severe performance bottlenecks due to page faulting, so simply switching over to unified memory is not an option for cuDF or libcudf. Before we can use unified memory in production, we need to implement mitigating strategies to avoid faults using either hinting or prefetching to trigger a migration. Here we propose using systematic prefetching for this purpose.

Goals:

Non-Goals:

Short-term Proposal (mix of 24.08 and 24.10)

We need to make an expedient set of changes to enable prefetching when using managed memory. While cudf.pandas is the primary target in the immediate term, we cannot realistically achieve this purely in the Python layer and will need some libcudf work. With that in mind, I propose the following changes:

  1. Implement a new PrefetchMemoryResource that performs a prefetch when data is allocated. This is important because injecting prefetches in cuIO is more challenging than in the rest of libcudf, so prefetching on allocate is a short-term fix that ensures buffers are prefetched before being written to in cuIO.
  2. Add a prefetch call to column_view/mutable_column_view::head.
  3. Subclass rmm::device_uvector to create cudf::device_uvector and add a prefetch call to cudf::device_uvector::data. All internal uses of rmm::device_uvector should be replaced with cudf::device_uvector, but functions returning rmm:device_uvector need not be changed.
  4. Add a global configuration option in libcudf to turn prefetching on or off. All three of the above prefetches (and any others added)

Items 2-4 are implemented in #16020 (3 is partially implemented; a global find-and-replace is still needed). Item 1 has been prototyped as a callback memory resource in Python for testing but needs to be converted to a proper C++ implementation.

This plan involves a number of compromises, but it offers a number of significant advantages that I think make it worthwhile to proceed in the short term.

The drawbacks of this plan:

Long-term plans

Here we lay out various potential long-term solutions to address the concerns above.

Adding new pointer accessors for prefetching

Instead of modifying the behavior of existing data accessors and gating the behavior behind a configuration, we could instead introduce new data accessors. For instance, we could add column_view::data_prefetch(rmm::cuda_stream_view).

Pros:

Cons:

Using a macro of some sort to trigger prefetching

Instead of adding new accessors, we could add a macro that could be inserted into algorithms to indicate a set of columns that need to be prefetched. This approach has essentially the same pros and cons as the above, so it's really a question of which implementation we prefer if we choose to go either of these routes.

Adding prefetching to rmm data structures

Pros:

Cons:

Updating cuIO to properly handle prefetching

Updating cuIO data structures to properly handle prefetching is a long-term requirement.

Pros:

Cons:

davidwendt commented 1 month ago

I'd like us to consider an alternate libcudf implementation that is more work but may be better in terms of control and maintenance going forward. I believe we could build a set of utilities that accept pointers or a variety of container types that perform the prefetch and then insert the prefetch/utility calls before each kernel launch. This provides the best control to the algorithm author when and what is prefetched with no surprises or side-effects.

I'd like to keep logic like this out of the containers (column_view and device_uvector). I feel these introduce hidden side-effects that would be difficult to avoid similar to the lazy-null-count logic that was removed several releases ago. I know this is more work but I think having the logic inline with the kernel launches will be easier to maintain and control. We can easily decide which algorithms need prefetching (and when , how, and which parts) and iteratively work on specific chunking solutions in the future without effecting all the other APIs.

vyasr commented 1 month ago

I concur with your assessment long term, but as detailed in the issue I don't think it is feasible on the timeline we are seeking. Inserting changes before every kernel launch, even fairly trivial changes, seems like a task that will take at least one full release since the initial work will require achieving consensus on what those changes should be.

Is there something I wrote in the issue that you disagree with? I tried to address pretty much this exact concern in the issue since I share it and anticipated that others would raise it at this point.

davidwendt commented 1 month ago

Is there something I wrote in the issue that you disagree with? I tried to address pretty much this exact concern in the issue since I share it and anticipated that others would raise it at this point.

I only disagree with modifying column_view and subclassing device_uvector even in the short term. The first makes me uneasy for the codebase because of its global nature. It likely will not hit all the desired code paths and may cause unnecessary prefetching in other cases (causing more workarounds, etc). The subclassed device_uvector requires a wide change to the codebase on a similar scale that I was proposing so it does not save us that much work.

I'm was hoping that we can add prefetch to a few APIs quickly using a targeted approach with a handful of utilities in the short term and then roll out the rest in the long term.

vyasr commented 1 month ago

I'm was hoping that we can add prefetch to a few APIs quickly using a targeted approach with a handful of utilities in the short term and then roll out the rest in the long term.

The problem I see with that approach is that while we might be able to see good results on a particular set of benchmarks that way, we will not be able to enable a managed memory resource as default without substantially slowing down a wide range of APIs (anything that doesn't have prefetching enabled). We should at minimum test running the cudf microbenchmarks with a managed memory resource. I suspect that the results will not support using a managed memory resource by default in cudf.pandas without the more blanket approach for prefetching, unless we choose to wait for the longer term solution where we roll out your proposed changes to more APIs.

vyasr commented 1 month ago

Copying from Slack:

We came to the following compromise during the discussion:

  • We will merge the column_view/mutable_column_view changes from #16020 to allow prefetching to occur on the widest set of APIs possible in the short term.
  • We will not merge the device_uvector changes because that requires touching many places. Instead, we will find everywhere that we would need to make such changes, and instead insert manual prefetch calls like in #16265. Since that is the long term solution that we prefer anyway, we should do that instead of changing device_uvector since it's the same number of places that need changing. My hope would be that in the short term these would all be prefetches on device_uvectors or device_buffers, the places where we know the above solution has no effect
  • We will include the prefetch allocator
  • We will keep the configuration options in place
  • Over the course of the next couple of months, we will run libcudf benchmarks and cudf Python microbenchmarks using managed memory and identify hot spots that need manual prefetching added. As we do this, we will turn off the column_view prefetching so that we ensure that we're capturing all of the same needs. Once we are satisfied, we will remove prefetching from column_view

I'm going to work on updating 16020 today to remove the undesirable changes, then David and I will aim to get his changes merged in tomorrow