Open ayushdg opened 2 years ago
I think this could possibly be done with the workaround you suggest and adding a bit of index "magic" because we know that all columns being exploded must have, for any given row, each list must have the same number of elements.
If you create a monotonically increasing id , you can merge with that key across your frames. Then, I think you can leverage the = elements per list requirement to do something like:
original_df = ...
merged_result = ...
index_repeats = = original_df.repeat(original_df[col].list.len().replace(0, 1)) # dont drop ones with 0 elements, so set as 1
merged_result.set_index(cudf.RangeIndex(start=0, stop=len(original_df)).repeat(index_repeats))
EDIT: This logic might get messy with more than a few columns, as you'd need to keep track of which frames get which columns
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Just to note, I think that explode
in libcudf is order-preserving. If so, you could just do:
columns = ...
current, *columns = columns
result = df.drop(set(columns) - {current}, axis=1).explode(current)
required_len = df[current].list.len()
while columns:
current, *columns = columns
if not (required_len == df[current].list.len()).all():
raise ValueError(...)
result[current] = df[current].explode()
result = result.reindex(df.keys(), axis=1)
Is your feature request related to a problem? Please describe. Pandas supports exploding multiple columns with
df.explode
, but currently cudf fails for that case.Describe the solution you'd like Support explode with multiple column names. The doc contains an example of the expected output from a multi column explode: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html#pandas-dataframe-explode
Describe alternatives you've considered I haven't been able to come up with a workaround to this, but it might be 2 separate explodes + a merge of some kind to get to the same result.
Additional context Add any other context, code examples, or references to existing implementations about the feature request here.