[FEA] Reduce page faults when using managed memory

GregoryKimball commented 11 months ago

Is your feature request related to a problem? Please describe. In cuDF-python and RMM, it's easy to opt into managed memory (also known as Unified Memory, UM, and Unified Virtual Memory, UVM). However, libcudf is not optimized for use with managed memory and encounters many "just too late" page faults when the "oversubscription factor" is >1.

Hinting options and strategies

Use hinting with cudaMemPrefetchAsync before operating on a column_view. I believe this hint will eagerly migrate the data to device. Open questions include: does it require an extra sync? do kernels page fault for a while until the data fully migrates?
Use hinting with cudaMemAdviseSetAccessedBy. This hinting also does not eagerly migrate the data, and seems to be focused on preventing faults between devices on the same node. It also allows direct memory access (DMA) from the device to pinned host buffers.
Use hinting with cudaMemAdviseSetPreferredLocation. This does not eagerly migrate the data, instead it influences the page migration system. If we set the preferred location to device, I believe this hint would prevent those allocations from being evicted and could lead to poor performance of the page migration engine.
Use hinting with cudaMemAdviseSetReadMostly. For column_view data, processing in libcudf algorithms will be read-only by design. We can communicate this to UVM, but for libcudf's common access pattern - read once and then write a new allocation for the results - I don't think "read mostly" hinting will give us higher throughput or reduced faulting.
Use host-pinned buffers and use direct memory access (DMA) from the device to extend working memory (see."zero-copy" in this blog). To execute this strategy we hint "preferred location" to host and "accessed by" to device. Memory throughput will be lower using DMA to host than using device memory, but stalling kernels on page faults will be much slower than waiting on DMA. We still need to design how and when we would choose to leave data on the host and access by DMA (always??? except intermediate allocations).
In addition to preventing page faults, we may also want to prevent evictions by preemptively clearing device memory. There does not appear to be a mechanism for eagerly migrating data from device to host. Perhaps preferred location hinting can also drive evictions on groups of pages instead of one page at a time.

Implementation ideas for libcudf

Where would this hinting be located in the repository? We could implement a RAII "advisor" class that takes a (non-owning) reference to a column_view and performs the appropriate hinting. The advisor class would only perform hinting for column_views created using managed memory resources. It may be difficult to add hinting to column_view because the column_view object can't tell if it's underlying data was a managed or unmanaged allocation.
Is a way to identify from a device pointer if the associated allocation is managed or unmanaged? Perhaps cuPointerGetAttribute() should return CU_MEMORYTYPE_UNIFIED as CU_POINTER_ATTRIBUTE_MEMORY_TYPE for managed memory. Is there a runtime API for accessing device pointer attributes? (TBD)

Useful reference for cudaMemAdvise

Please note: Using managed memory in libcudf is in early stages of scoping. This issue will improve over time.

Describe the solution you'd like I would like to add a libcudf benchmark for studying managed memory performance, and then some targeted experiments (with profiling) to observe the impact of different hinting strategies. When we have identified a promising design, we will open a more targeted issue.

Describe alternatives you've considered Continue to let Dask and Spark-RAPIDS catch and retry when there are device OOM errors.

Additional context Please note that with managed memory pools, the pool allocation is lazy. This is different from unmanaged memory pools where we allocate the full pool upfront, trading slightly longer startup time for much faster algorithm allocations.

Useful blog posts: https://developer.nvidia.com/blog/unified-memory-cuda-beginners/ https://developer.nvidia.com/blog/improving-gpu-memory-oversubscription-performance/ https://developer.nvidia.com/blog/maximizing-unified-memory-performance-in-cuda/ https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/

revans2 commented 11 months ago

The other thing to think about is what happens if you are not using UVM. Putting hints in is nice, but are they going to slow down the processing when UVM is not being used? If they do how is the best way to mitigate this?

bdice commented 11 months ago

@revans2 I do not expect that this would affect non-managed allocations. My expectation is that cudf will need to determine (or track) if an allocation is managed before attempting prefetching or giving other advice to the driver. That should be very inexpensive to check/track. Non-UVM cases shouldn’t see regressions as a result.

wence- commented 11 months ago

It seems like the right object to offer the ability to hint allocations is the memory resource, in which case non-managed memory resources could provide no-op implementations.

rapidsai / cudf