[FEA] df.explode with multiple columns

rapidsai / cudf

cuDF - GPU DataFrame Library

https://docs.rapids.ai/api/cudf/stable/

Apache License 2.0

8.03k stars 872 forks source link

[FEA] df.explode with multiple columns #10271

Open ayushdg opened 2 years ago

ayushdg commented 2 years ago

Is your feature request related to a problem? Please describe. Pandas supports exploding multiple columns with df.explode, but currently cudf fails for that case.

Describe the solution you'd like Support explode with multiple column names. The doc contains an example of the expected output from a multi column explode: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html#pandas-dataframe-explode

Describe alternatives you've considered I haven't been able to come up with a workaround to this, but it might be 2 separate explodes + a merge of some kind to get to the same result.

Additional context Add any other context, code examples, or references to existing implementations about the feature request here.

beckernick commented 2 years ago

I think this could possibly be done with the workaround you suggest and adding a bit of index "magic" because we know that all columns being exploded must have, for any given row, each list must have the same number of elements.

If you create a monotonically increasing id , you can merge with that key across your frames. Then, I think you can leverage the = elements per list requirement to do something like:

original_df = ...
merged_result = ...

index_repeats = = original_df.repeat(original_df[col].list.len().replace(0, 1)) # dont drop ones with 0 elements, so set as 1
merged_result.set_index(cudf.RangeIndex(start=0, stop=len(original_df)).repeat(index_repeats))

EDIT: This logic might get messy with more than a few columns, as you'd need to keep track of which frames get which columns

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

wence- commented 1 year ago

Just to note, I think that explode in libcudf is order-preserving. If so, you could just do:

columns = ...
current, *columns = columns
result = df.drop(set(columns) - {current}, axis=1).explode(current)
required_len = df[current].list.len()
while columns:
    current, *columns = columns
    if not (required_len == df[current].list.len()).all():
        raise ValueError(...)
    result[current] = df[current].explode()
result = result.reindex(df.keys(), axis=1)