[FEA] `pack`/`unpack` functions to merge/split (multiple) `device_buffer`(s)

jakirkham commented 4 years ago

Is your feature request related to a problem? Please describe.

It would be useful to have a pack function to merge multiple device_buffers into a single device_buffer. This is helpful in situations where having one large device_buffer to read from is more performant. However it ultimately consists of many smaller data segments that would need to be merged together. Example use cases include sending data with UCX and spilling data from device to host.

Similarly it would be useful to have an unpack function to split a device_buffer into multiple device_buffers. This is helpful in situations where having one large device_buffer to write into is more performant. However it ultimately consists of many smaller data segments that may need to be freed at different times. Example use cases include receiving data with UCX and unspilling data from host to device.

Describe the solution you'd like

For pack it would be nice if it simply takes several device_buffers in vector form and return a single one. Additionally it would be nice if pack could recognize when device_buffers are contiguous in memory and avoid a copy. Though admittedly this last part is tricky (maybe less so if unpack is used regularly?). If we allow pack to change the order (to benefit from contiguous memory for example), we may want additional information about where the data segments live in the larger device_buffer.

For unpack it would be nice if it takes a single device_buffer and size_ts in vector form to split and return a vector of multiple device_buffers. Additionally it would be nice if unpack did not perform any copies. Hopefully that is straightforward, but there may be things I'm not understanding.

Describe alternatives you've considered

One might consider using variadics in C++ for the arguments. While nice at the C++ level, this seems tricky to use from the Cython and Python levels. Hence the suggestion to just use vector.

pack itself could be implemented by a user simply allocating a larger buffer and copying over. Would be nice to avoid the extra allocation when possible though (which may require knowledge that RMM has about the allocations).

Additional context

Having unpack in particular would be helpful for aggregated receives. A natural extension of this would be to have pack for aggregated sends. All-in-all this should allow transmitting a larger amount of data at once with UCX and thus benefiting from this use case it is more honed for. PR ( https://github.com/dask/distributed/pull/3453 ) provides a WIP implementation of aggregated receives for context.

Also having pack would be useful when spilling several device_buffers from device to host as it would allow us to pack them into one device_buffer before transferring ( https://github.com/rapidsai/dask-cuda/issues/250 ). Having unpack would help us break up the allocation whenever the object is unspilled.

This need has also come up in downstream contexts ( https://github.com/rapidsai/cudf/issues/3793 ). Maybe they would benefit from an upstream solution as well?

jakirkham commented 4 years ago

As to pack, maybe these CUDA Virtual Memory Management APIs (pointed out by Cory in a related context) would be useful?

jrhemstad commented 4 years ago

This functionality is kind of outside the scope of RMM. The direction we'd like to go with RMM is really to just be as simple as possible by providing a set of resources, the tools to get/set the default resource, and containers like device_buffer/device_vector that work with resources.

This kind of pack or concatenate functionality is really more the wheelhouse of a consumer of RMM like cuDF.

In cuDF, this is just a concatenate. https://github.com/rapidsai/cudf/pull/4224

kkraus14 commented 4 years ago

In cuDF, this is just a concatenate. rapidsai/cudf#4224

That imposes a lot of challenges in order to construct your data in a way to allow using concatenate and then being able to unpack it cleanly.

Maybe it's a new API that could live in cuDF, but really we're looking for some API that takes vector<device_buffer> and returns us a device_buffer along with some form of metadata to then be able to "unpack" it back into a vector<device_buffer>.

This is along similar lines as https://github.com/rapidsai/cudf/issues/3793 but is more generalized than cudf::table_view as this could be reused in a lot of non-cudf places.

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 3 years ago

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

harrism commented 3 years ago

Seems like this issue should be moved to libcudf.

kkraus14 commented 3 years ago

Seems like this issue should be moved to libcudf.

I think we had this in RMM as opposed to libcudf because we wanted a place more general purpose than libcudf. I.E. Dask/Distributed would possibly be interested in using this for packing/unpacking buffers in communication, but cuDF is way too bulky of a dependency for them.

jrhemstad commented 3 years ago

Seems like this issue should be moved to libcudf.

I think we had this in RMM as opposed to libcudf because we wanted a place more general purpose than libcudf. I.E. Dask/Distributed would possibly be interested in using this for packing/unpacking buffers in communication, but cuDF is way too bulky of a dependency for them.

Sure, but RMM isn't a catch-all for stuff we don't want to put into libcudf. It muddies the single responsibility principle to start putting random kernels into an allocator and container library (which currently has no kernels).

kkraus14 commented 3 years ago

Yup agreed. This felt like it was in the gray area of somewhat related to memory management so the issue was raised here, but happy to defer it to somewhere else, but cuDF is too large of a dependency unfortunately.

jakirkham commented 3 years ago

I think cupy.concatenate would also work here. Not sure if that is too large of a dependency for this use case (or if we are looking for a C++ operation too)

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

jrhemstad commented 2 years ago

https://github.com/NVIDIA/cub/pull/359 would be the right way to do this now.

jakirkham commented 1 year ago

FYI PR ( https://github.com/NVIDIA/cub/pull/359 ) landed. Looks like this will be part of CUB 2.1.0

rapidsai / cudf

[FEA] `pack`/`unpack` functions to merge/split (multiple) `device_buffer`(s) #9726