rapidsai / rmm

RAPIDS Memory Manager
https://docs.rapids.ai/api/rmm/stable/
Apache License 2.0
446 stars 188 forks source link

Add rmm::prefetch() and DeviceBuffer.prefetch() #1573

Closed harrism closed 4 weeks ago

harrism commented 1 month ago

Description

This adds two rmm::prefetch() functions in C++.

  1. rmm::prefetch(void *ptr, size_t bytes, device, stream)
  2. rmm::prefetch<T>(cuda::std::span<T> data, device, stream)

Item 2 enables prefetching the containers that RMM provides (device_uvector, device_scalar) that support conversion to cuda::std::span. In order to enable that, device_scalar::size() is added.

Note that device_buffers must be prefetched using item 1 because you can't create a span<void>.

In Python, this adds DeviceBuffer.prefetch() because that's really the only RMM Python data type to prefetch. There is one Cython use of device_uvector in cuDF join that we might need to add prefetch support for later.

prefetch is a no-op on non-managed memory. Rather than querying the type of memory, it just catches cudaErrorInvalidValue from cudaMemPrefetchAsync.

Checklist

jrhemstad commented 1 month ago

I don't think this should be a member function. The fact that the member function is only useful for specific upstream resource types is a big code smell to me.

Instead, I think this should be a freestanding function like:

template <typename T>
auto prefetch(T const* ptr, size_t n, device, stream);
harrism commented 1 month ago

I don't think this should be a member function. The fact that the member function is only useful for specific upstream resource types is a big code smell to me. Instead, I think this should be a freestanding function

I actually had the same thought over the weekend. But there are a few other factors to consider.

  1. In RMM Python, the only thing one has to prefetch is a DeviceBuffer. And pointers are non-Pythonic. So I think it makes sense to associate prefetch with DeviceBuffer in the python API, either as a method or a free function that expects a DeviceBuffer.
  2. I do think we want to put unconditional calls to prefetch in algorithm code (e.g. in cuDF). These should be functional no-ops when the memory is not managed. (A migrateable or similar property for cuda::mr would help with that...)
  3. What do you think about versions for containers and iterators? I think only providing a pointer interface is error-prone, and makes it harder than necessary to prefetch a range from a container, or a whole container.
template <typename Iterator>
auto prefetch(Iterator begin, Iterator end, device, stream)
{
  constexpr auto elt_size = sizeof(std::iterator_traits<Iterator>::value_type);
  return prefetch(begin, std::distance(begin, end) * elt_size, device, stream);
}

template <typename Container>
auto prefetch(Container const& c, device, stream) 
{ 
  return prefetch(c.begin(), c.end(), device, stream);
}
harrism commented 1 month ago

I think with HMM/ATS, system allocated memory (malloc/free) can also be prefetched, but we can cross that bridge when we come to it.

@rongou I crossed the bridge: made the docs a bit less specific to managed memory based on your suggestion, added a mention of / link to HMM. I couldn't find a good link for ATS though.

harrism commented 1 month ago

Apologies, I had intended to review this but haven't quite made the time for it yet and things have moved along quite quickly. I'll attempt to review tomorrow, but don't let my missing review stop you from merging if you think this is ready.

No worries, still need a cmake codeowners approval anyway, and I would like @jrhemstad to approve since he complained about the device_buffer method approach I originally took. Also @miscco for the span stuff.

wence- commented 4 weeks ago

Devcontainer build fails are due to (I think) not having https://github.com/NVIDIA/cccl/pull/1836

harrism commented 4 weeks ago

Devcontainer build fails are due to (I think) not having NVIDIA/cccl#1836

I don't understand. This was working previously. I think this broke with the update to CCCL 2.5, is this a regression in 2.5?

@miscco any idea why this was working fine before? I tried to go back to your godbolt example but godbolt only supports up to CCCL 2.2

Note that this only affects device_scalar, device_uvector converts to a span no problem. I confirmed that adding begin/end to device_scalar fixes this, but I'm not sure we want to do that...

leofang commented 4 weeks ago

It seems the doc-build pipeline failed but there's no log. Is it possible to retrigger it?

harrism commented 4 weeks ago

/merge