pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.62k stars 17.57k forks source link

ENH: Add Option to Include Array Offset as MultiIndex Level in `explode()` #59163

Open chelsea-lin opened 3 days ago

chelsea-lin commented 3 days ago

Feature Type

Problem Description

Currently, df.explode() and s.explode() flatten lists/arrays within Series/DataFrames. However, information about the original position of each element within its list is lost. This makes it difficult to:

Proposed Solution: Introduce a new parameter, offset, to both df.explode() and s.explode().

Example Usage:

>>> s = pd.Series([[1, 2, 3], 'foo', [], [3, 4]])
>>> s
0    [1, 2, 3]
1          foo
2           []
3       [3, 4]
dtype: object
>>> s.explode() # <- Current behavior:
0         1
0         2
0         3
1       foo
2       NaN
3         3
3         4
dtype: object

>>> s.explode(offset=True) # <- With proposed feature
0  1         1
   2         2
   3         3
1  1       foo
2  1       NaN
3  1         3
   2         4
dtype: object

Feature Description

Introduce a new parameter, offset, to both df.explode() and s.explode().

def explode(self, ..., offset: bool = False):  # Default to False for backward compatibility
    """
    Parameters:
        ...
        offset: If True, include the original array offset as a level in the resulting MultiIndex.
    """

Alternative Solutions

While it's technically possible to infer the offset in some cases, it requires additional steps and assumptions about the data. The offset parameter provides a direct, intuitive solution.

Additional Context

No response

ritwizsinha commented 2 days ago

take

ritwizsinha commented 2 days ago

@chelsea-lin with the given small utility isn't it always possible to get the column offsets

import pandas as pd

def get_col_and_row_offsets(df):
    exploded_df = df.explode(ignore_index=False).to_frame(name='col1')
    exploded_df['col_offset'] = exploded_df.groupby(level=0).cumcount()
    exploded_df['row_offset'] = exploded_df.index

    return exploded_df

This gives result as shown

  col1  col_offset  row_offset
0    1           0           0
0    2           1           0
0    3           2           0
1  foo           0           1
2  NaN           0           2
3    3           0           3
3    4           1           3

Am I missing some edge case here?

chelsea-lin commented 2 days ago

@ritwizsinha Thanks for tackling this! You've got it right. The dataframe/series relies on its index and offset (implicitly) for ordering. The get_col_and_row_offset method is the alternative solution. However, it could be expensive with larger datasets. That's why I'm curious if explode could provide the offset directly, potentially with better performance.

ritwizsinha commented 2 days ago

If we need to show column_offsets of all the items in the Series/DataFrame, that would be in the best case of an order of complexity linear or O(N) where the number of items are N. I don't think we can do better than this if we need to show all offsets. For getting offset of one element it might be possible to do in constant time, but need to research more for that.

chelsea-lin commented 1 day ago

I agree that the expected time complexity is likely O(N). Intuitively, the difference is that explode(offset=True) scans the data once, while get_col_and_row_offset might require two scans. However, I'm not entirely familiar with pandas internals, so further investigation is needed.

ritwizsinha commented 1 day ago

Did some research The explode function is defined here

There are plenty of ways of adding offset list to the explode API:

  1. The python explode calls the reshape.explode function which is a cython function, returning the items and the count of items in each row. It would be more efficient to calculate the offsets in cython and then passing the offset list as well, but that will change the return type of the function causing an intrusive change.
  2. The other option is to recalculate the column offsets after we get the values and row items counts in the explode function in python. This would be slow but less intrusive.
  3. Third option might be to add a new cython function which takes in the row item count Series and creates an offset Series out of it.

Before benchmarking all of this, I think we need to ensure that we need to support this or not.