rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.42k stars 899 forks source link

[QST] Is there a way to re-construct cudf DataFrame by __cuda_array_interface__ #11462

Closed wbo4958 closed 2 years ago

wbo4958 commented 2 years ago

Hi there,

I'm asking if there is a way to re-construct cudf dataframe or Series according to cuda_array_interface?

the cuda_array_interface is like

{'shape': [5], 'data': ['xxxxxx', False], 'typestr': '<i4', 'version': 1}
trivialfis commented 2 years ago

Hi, to provide more context, @wbo4958 is currently working on https://github.com/NVIDIA/spark-rapids/issues/5561 for enabling sharing cuDF columns between jvm process and Python process via CUDA IPC. @wbo4958 generated the cuda array interface from jvm side as a reference to the underlying data and passed it to the python process as a message. Other forms of references are also possible. This feature request is more broadly about reconstructing a DataFrame from a set of pointers/IPC handles along with needed metadata.

I looked into the from_dlpack constructor, which should take ownership of the data and is similar to what we need for constructing cudf dataframe from the handle. If this feature is desired I can help work on a PR.

wence- commented 2 years ago

I looked into the from_dlpack constructor, which should take ownership of the data and is similar to what we need for constructing cudf dataframe from the handle. If this feature is desired I can help work on a PR.

Right now, from_dlpack copies the data, which is not what you want. It could be updated to share the pointer, but is a little fiddly.

shwina commented 2 years ago

Note that, while currently somewhat buggy, cuDF does support the __dataframe__ interchange protocol which is perhaps what you need here rather than __cuda_array_interface__. I aim to resolve many of the outstanding issues with our implementation of the protocol during this release (22.10).

trivialfis commented 2 years ago

Thank you for the suggestions! The __dataframe__ protocol seems to be a good starting point for us to hack into cuDF.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.