rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.47k stars 907 forks source link

[FEA] Add public interop functions between pylibcudf and cudf classic #17191

Open vyasr opened 1 month ago

vyasr commented 1 month ago

Is your feature request related to a problem? Please describe. cuDF Python has historically focused on providing a pandas-like API. pylibcudf offers low-overhead access to all of libcudf's functionality, including many functions that do not fit cleanly into any pandas-like front end. Moreover, with the introduction of cudf-polars, we are likely to add more functionality to libcudf that is not needed by the pandas front end to support polars. At present there are internal APIs to convert between cudf and pylibcudf APIs, but there are no corresponding public functions. In the near term, we expect most pylibcudf users to be cudf users who want to access some specific bits of functionality from pylibcudf rather than users looking to switch entirely over to pylibcudf, but to provide access to this potential user base we need to make converting between cudf and pylibcudf as seamless as possible

Describe the solution you'd like To facilitate these users, we should implement to_pylibcudf and from_pylibcudf functions in cudf Python's Series and DataFrame classes. These functions should produce the equivalent underlying types that represent the types as faithfully as possible

Additional context cudf's Cython layer is being rewritten around pylibcudf and will soon be phased out. Over the course of the next few releases we will be rewriting the Column layer to use pylibcudf directly. While it does not make sense for us to wait for that to happen to make pylibcudf more publicly usable in the near future, we should make sure that any API decisions that we make now do not hamstring our ability to perform future refactorings. One important factor to consider is whether a DataFrame will actually map to a pylibcudf.Table or whether it will maintain a list of pylibcudf.Column objects instead. We need to maintain a name->column index mapping somewhere in cudf, and that may affect this choice. We will also need to think about the fact that cudf Columns may or may not have a 1-1 mapping to pylibcudf columns. Of particular note are categorical columns, which in the ordered case may actually have a one-to-many mapping (really 1:2), and in the unordered case may wind up finally mapping to the underlying dictionary types in libcudf (which are largely unused at this stage).

bdice commented 1 day ago

For Series, maybe we just need to expose Column.from_pylibcudf and Column.to_pylibcudf publicly? There are some notes about adding a copy parameter. I don't know what it means by "mark the underlying buffers as exposed."

https://github.com/rapidsai/cudf/blob/e7022fbc22eda538783e67f32d35ea8ea0798be8/python/cudf/cudf/_lib/column.pyx#L603

https://github.com/rapidsai/cudf/blob/e7022fbc22eda538783e67f32d35ea8ea0798be8/python/cudf/cudf/_lib/column.pyx#L459-L465

For DataFrame, I think the right answer is to convert to pylibcudf.Table and drop the column names. Perhaps it could return a tuple of a table and some kind of metadata object, or have a separate API for extracting the metadata needed to reconstruct the same cuDF DataFrame (this is what Arrow does with its separate schema objects).