Closed pvnick closed 1 week ago
Hi @pvnick , thanks for the report. We'll investigate and follow up on this issue.
For context, the decision to disallow iteration over GPU objects is intentional -- it keeps users from accidentally triggering many host-device transfers (e.g. in a for
loop) that are highly inefficient. This is problematic in some cases when column names are part of an object on the GPU that needs to be iterated over. The solution to this will likely require some code change in dask-cudf to convert the StringIndex
into a type that is supported on the host.
While it is inefficient to iterate row-wise over the dataframe, it's pretty difficult to adapt all of dask-dataframe to do something different based on cudf/pandas. Note we can't really do this in dask-cudf without monkey-patching and/or reimplementing dask.dataframe.pivot_table
.
I'm not sure the iteration is that inefficient, if we implemented it as (for a stringindex)
def __iter__(self):
return iter(self.to_pandas())
There's only one device-to-host copy
I am leaning towards the same view as Lawrence here. We've had these disabled code paths for a long time, and while I understand the rationale I think at this point I'm OK with relaxing this behavior. Especially in light of cudf.pandas or dask integration, disabling a code path in a way that breaks those weights seems less favorable than it may once have.
I’m okay with that proposal. My comments above were primarily to establish historical context — I am alright with changing the behavior to solve compatibility issues.
Describe the bug Pivot_table fails on a dask_cudf dataframe due to an unimplemented Index iteration function:
Steps/Code to reproduce bug
Error:
Expected behavior Pivot_table succeeds as documented.
Environment overview (please complete the following information) Installed cuDF using pip, using the stable release:
Environment details