Closed harrism closed 4 weeks ago
I don't think this should be a member function. The fact that the member function is only useful for specific upstream resource types is a big code smell to me.
Instead, I think this should be a freestanding function like:
template <typename T>
auto prefetch(T const* ptr, size_t n, device, stream);
I don't think this should be a member function. The fact that the member function is only useful for specific upstream resource types is a big code smell to me. Instead, I think this should be a freestanding function
I actually had the same thought over the weekend. But there are a few other factors to consider.
DeviceBuffer
. And pointers are non-Pythonic. So I think it makes sense to associate prefetch with DeviceBuffer
in the python API, either as a method or a free function that expects a DeviceBuffer.prefetch
in algorithm code (e.g. in cuDF). These should be functional no-ops when the memory is not managed. (A migrateable
or similar property for cuda::mr
would help with that...)template <typename Iterator>
auto prefetch(Iterator begin, Iterator end, device, stream)
{
constexpr auto elt_size = sizeof(std::iterator_traits<Iterator>::value_type);
return prefetch(begin, std::distance(begin, end) * elt_size, device, stream);
}
template <typename Container>
auto prefetch(Container const& c, device, stream)
{
return prefetch(c.begin(), c.end(), device, stream);
}
I think with HMM/ATS, system allocated memory (malloc/free) can also be prefetched, but we can cross that bridge when we come to it.
@rongou I crossed the bridge: made the docs a bit less specific to managed memory based on your suggestion, added a mention of / link to HMM. I couldn't find a good link for ATS though.
Apologies, I had intended to review this but haven't quite made the time for it yet and things have moved along quite quickly. I'll attempt to review tomorrow, but don't let my missing review stop you from merging if you think this is ready.
No worries, still need a cmake codeowners approval anyway, and I would like @jrhemstad to approve since he complained about the device_buffer
method approach I originally took. Also @miscco for the span
stuff.
Devcontainer build fails are due to (I think) not having https://github.com/NVIDIA/cccl/pull/1836
Devcontainer build fails are due to (I think) not having NVIDIA/cccl#1836
I don't understand. This was working previously. I think this broke with the update to CCCL 2.5, is this a regression in 2.5?
@miscco any idea why this was working fine before? I tried to go back to your godbolt example but godbolt only supports up to CCCL 2.2
Note that this only affects device_scalar, device_uvector converts to a span no problem. I confirmed that adding begin/end to device_scalar
fixes this, but I'm not sure we want to do that...
It seems the doc-build pipeline failed but there's no log. Is it possible to retrigger it?
/merge
Description
This adds two
rmm::prefetch()
functions in C++.rmm::prefetch(void *ptr, size_t bytes, device, stream)
rmm::prefetch<T>(cuda::std::span<T> data, device, stream)
Item 2 enables prefetching the containers that RMM provides (
device_uvector
,device_scalar
) that support conversion tocuda::std::span
. In order to enable that,device_scalar::size()
is added.Note that
device_buffer
s must be prefetched using item 1 because you can't create aspan<void>
.In Python, this adds
DeviceBuffer.prefetch()
because that's really the only RMM Python data type to prefetch. There is one Cython use ofdevice_uvector
in cuDFjoin
that we might need to add prefetch support for later.prefetch
is a no-op on non-managed memory. Rather than querying the type of memory, it just catchescudaErrorInvalidValue
fromcudaMemPrefetchAsync
.Checklist