Open chelsea-lin opened 3 days ago
take
@chelsea-lin with the given small utility isn't it always possible to get the column offsets
import pandas as pd
def get_col_and_row_offsets(df):
exploded_df = df.explode(ignore_index=False).to_frame(name='col1')
exploded_df['col_offset'] = exploded_df.groupby(level=0).cumcount()
exploded_df['row_offset'] = exploded_df.index
return exploded_df
This gives result as shown
col1 col_offset row_offset
0 1 0 0
0 2 1 0
0 3 2 0
1 foo 0 1
2 NaN 0 2
3 3 0 3
3 4 1 3
Am I missing some edge case here?
@ritwizsinha Thanks for tackling this!
You've got it right. The dataframe/series relies on its index and offset (implicitly) for ordering. The get_col_and_row_offset
method is the alternative solution. However, it could be expensive with larger datasets. That's why I'm curious if explode
could provide the offset directly, potentially with better performance.
If we need to show column_offsets of all the items in the Series/DataFrame, that would be in the best case of an order of complexity linear or O(N) where the number of items are N. I don't think we can do better than this if we need to show all offsets. For getting offset of one element it might be possible to do in constant time, but need to research more for that.
I agree that the expected time complexity is likely O(N)
. Intuitively, the difference is that explode(offset=True)
scans the data once, while get_col_and_row_offset
might require two scans. However, I'm not entirely familiar with pandas internals, so further investigation is needed.
Did some research The explode function is defined here
There are plenty of ways of adding offset list to the explode API:
Before benchmarking all of this, I think we need to ensure that we need to support this or not.
Feature Type
[X] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas
Problem Description
Currently,
df.explode()
ands.explode()
flatten lists/arrays within Series/DataFrames. However, information about the original position of each element within its list is lost. This makes it difficult to:Proposed Solution: Introduce a new parameter,
offset
, to bothdf.explode()
ands.explode()
.Example Usage:
Feature Description
Introduce a new parameter,
offset
, to bothdf.explode()
ands.explode()
.Alternative Solutions
While it's technically possible to infer the offset in some cases, it requires additional steps and assumptions about the data. The offset parameter provides a direct, intuitive solution.
Additional Context
No response