Open jakirkham opened 4 years ago
As to pack
, maybe these CUDA Virtual Memory Management APIs (pointed out by Cory in a related context) would be useful?
This functionality is kind of outside the scope of RMM. The direction we'd like to go with RMM is really to just be as simple as possible by providing a set of resources, the tools to get/set the default resource, and containers like device_buffer/device_vector
that work with resources.
This kind of pack or concatenate functionality is really more the wheelhouse of a consumer of RMM like cuDF.
In cuDF, this is just a concatenate
. https://github.com/rapidsai/cudf/pull/4224
In cuDF, this is just a concatenate. rapidsai/cudf#4224
That imposes a lot of challenges in order to construct your data in a way to allow using concatenate and then being able to unpack it cleanly.
Maybe it's a new API that could live in cuDF, but really we're looking for some API that takes vector<device_buffer>
and returns us a device_buffer
along with some form of metadata to then be able to "unpack" it back into a vector<device_buffer>
.
This is along similar lines as https://github.com/rapidsai/cudf/issues/3793 but is more generalized than cudf::table_view
as this could be reused in a lot of non-cudf places.
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.
Seems like this issue should be moved to libcudf.
Seems like this issue should be moved to libcudf.
I think we had this in RMM as opposed to libcudf because we wanted a place more general purpose than libcudf. I.E. Dask/Distributed would possibly be interested in using this for packing/unpacking buffers in communication, but cuDF is way too bulky of a dependency for them.
Seems like this issue should be moved to libcudf.
I think we had this in RMM as opposed to libcudf because we wanted a place more general purpose than libcudf. I.E. Dask/Distributed would possibly be interested in using this for packing/unpacking buffers in communication, but cuDF is way too bulky of a dependency for them.
Sure, but RMM isn't a catch-all for stuff we don't want to put into libcudf. It muddies the single responsibility principle to start putting random kernels into an allocator and container library (which currently has no kernels).
Yup agreed. This felt like it was in the gray area of somewhat related to memory management so the issue was raised here, but happy to defer it to somewhere else, but cuDF is too large of a dependency unfortunately.
I think cupy.concatenate
would also work here. Not sure if that is too large of a dependency for this use case (or if we are looking for a C++ operation too)
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
https://github.com/NVIDIA/cub/pull/359 would be the right way to do this now.
FYI PR ( https://github.com/NVIDIA/cub/pull/359 ) landed. Looks like this will be part of CUB 2.1.0
Is your feature request related to a problem? Please describe.
It would be useful to have a
pack
function to merge multipledevice_buffer
s into a singledevice_buffer
. This is helpful in situations where having one largedevice_buffer
to read from is more performant. However it ultimately consists of many smaller data segments that would need to be merged together. Example use cases include sending data with UCX and spilling data from device to host.Similarly it would be useful to have an
unpack
function to split adevice_buffer
into multipledevice_buffer
s. This is helpful in situations where having one largedevice_buffer
to write into is more performant. However it ultimately consists of many smaller data segments that may need to be freed at different times. Example use cases include receiving data with UCX and unspilling data from host to device.Describe the solution you'd like
For
pack
it would be nice if it simply takes severaldevice_buffer
s invector
form and return a single one. Additionally it would be nice ifpack
could recognize whendevice_buffer
s are contiguous in memory and avoid a copy. Though admittedly this last part is tricky (maybe less so ifunpack
is used regularly?). If we allowpack
to change the order (to benefit from contiguous memory for example), we may want additional information about where the data segments live in the largerdevice_buffer
.For
unpack
it would be nice if it takes a singledevice_buffer
andsize_t
s invector
form to split and return avector
of multipledevice_buffer
s. Additionally it would be nice ifunpack
did not perform any copies. Hopefully that is straightforward, but there may be things I'm not understanding.Describe alternatives you've considered
One might consider using variadics in C++ for the arguments. While nice at the C++ level, this seems tricky to use from the Cython and Python levels. Hence the suggestion to just use
vector
.pack
itself could be implemented by a user simply allocating a larger buffer and copying over. Would be nice to avoid the extra allocation when possible though (which may require knowledge that RMM has about the allocations).Additional context
Having
unpack
in particular would be helpful for aggregated receives. A natural extension of this would be to havepack
for aggregated sends. All-in-all this should allow transmitting a larger amount of data at once with UCX and thus benefiting from this use case it is more honed for. PR ( https://github.com/dask/distributed/pull/3453 ) provides a WIP implementation of aggregated receives for context.Also having
pack
would be useful when spilling severaldevice_buffer
s from device to host as it would allow us to pack them into onedevice_buffer
before transferring ( https://github.com/rapidsai/dask-cuda/issues/250 ). Havingunpack
would help us break up the allocation whenever the object is unspilled.This need has also come up in downstream contexts ( https://github.com/rapidsai/cudf/issues/3793 ). Maybe they would benefit from an upstream solution as well?