pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

API: should reindex on a level introduce NaNs for missing entries per label of other levels? #12319

Open jorisvandenbossche opened 8 years ago

jorisvandenbossche commented 8 years ago

Suppose the following dataframe and reindex operation:

In [65]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=pd.MultiIndex.from_tuples([('A', 'a'), ('A', 'b'), ('B', 'a')]))

In [66]: df
Out[66]:
   A       B
   a   b   a
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

In [67]: df.reindex(columns=['a', 'b'], level=1)
Out[67]:
   A       B
   a   b   a
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

Should this give the following?

In [67]: df.reindex(columns=['a', 'b'], level=1)
Out[67]:
   A       B
   a   b   a    b
0  0   1   2  NaN
1  3   4   5  NaN
2  6   7   8  NaN
3  9  10  11  NaN

I am not sure what the exact behaviour of the level keyword should be, but eg in the following example it does the selection of columns for each of label of the other level:

In [69]: df2 = pd.DataFrame(np.arange(18).reshape(3,6), columns=pd.MultiIndex.from_product([('A', 'B'), ('a', 'b', 'c')]))
In [70]:

In [70]: df2
Out[70]:
    A           B
    a   b   c   a   b   c
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17

In [71]: df2.reindex(columns=['a', 'c'], level=1)
Out[71]:
    A       B
    a   c   a   c
0   0   2   3   5
1   6   8   9  11
2  12  14  15  17
jreback commented 8 years ago

xref #7895 (a lot of dupe as well), maybe close the other? dupe of #11226, but I like your example better xref #10549

jreback commented 8 years ago

If this is a leaf level then its easy to do this, but what if its not, then do you do a cartesian product of everything below that level? we could, but would want a broadcast argument I think (though this just maybe making it even more complicated)

jreback commented 7 years ago

xref #7895

andenter commented 2 years ago

Curious why the expected behavior should be different?

ybagdasa commented 2 years ago

I'm spent a few hours looking into the issue, but I am not familiar enough with index joins to know what the problem is. I think it might be happening in the function Index._join_levelfound in pandas/core/indexes/base.py.

Within the function new_level, left_lev_indexer, right_lev_indexer = old_level.join( right, how=how, return_indexers=True ) is called. The variablehow is a right join. Not sure whether the outputs are as expected here. new_level seems to be correct.

Finally near the end of the function join_index = MultiIndex( levels=new_levels, codes=new_codes, names=left.names, verify_integrity=False, ) is called. The new_codes appear to be the same as the old multiindex codes. I wonder if this is the issue.

Hope that helps someone willing to fix it.

FirefoxMetzger commented 1 year ago

I ran into this yesterday and it caught me by surprise. If you do a reindex without level it does add fill_value for the missing entries in the flat index, and I expected this behavior to carry over to the particular level I specify.

So, yes, I would have very much expected it to be a cartesian product between the new index (passed into reindex) and all unique entries of the old index after the selected level got removed. For example, if I have a dataframe with a 3-level multi index:

df = pd.DataFrame(
    np.arange(10).reshape(5, 2),
    index=pd.MultiIndex.from_tuples(
        [
            ("A", "a", 2),
            ("A", "b", 0),
            ("B", "a", 1),
            ("B", "a", 3),
            ("B", "d", 1),
        ]
    ),
)
|               |   0 |   1 |
|:--------------|----:|----:|
| ('A', 'a', 2) |   0 |   1 |
| ('A', 'b', 0) |   2 |   3 |
| ('B', 'a', 1) |   4 |   5 |
| ('B', 'a', 3) |   6 |   7 |
| ('B', 'd', 1) |   8 |   9 |

and I reindex the "middle" index using new_index = pd.Index(["a", "b", "c", "d"]) then I would expect df.reindex(new_index, level=1) to produce:

|               |   0 |   1 |
|:--------------|----:|----:|
| ('A', 'a', 2) |   0 |   1 |
| ('A', 'b', 2) | nan | nan |
| ('A', 'c', 2) | nan | nan |
| ('A', 'd', 2) | nan | nan |
| ('A', 'a', 0) | nan | nan |
| ('A', 'b', 0) |   2 |   3 |
| ('A', 'c', 0) | nan | nan |
| ('A', 'd', 0) | nan | nan |
| ('B', 'a', 1) |   4 |   5 |
| ('B', 'b', 1) | nan | nan |
| ('B', 'c', 1) | nan | nan |
| ('B', 'd', 1) |   8 |   9 |
| ('B', 'a', 3) |   6 |   7 |
| ('B', 'b', 3) | nan | nan |
| ('B', 'c', 3) | nan | nan |
| ('B', 'd', 3) | nan | nan |

Which is equivalent to this (naive, but hopefully easy to read) python implementation:

level = 1
remaining_levels = df.reset_index(level=level, drop=True).index.unique()
new_index_tuples = []
for old_levels in remaining_levels:
    for new_level in new_index:
        new_index_tuples.append((*old_levels[:level], new_level, *old_levels[level:]))
target_index = pd.MultiIndex.from_tuples(new_index_tuples)
df.reindex(target_index)

(see also: https://stackoverflow.com/questions/75106282/how-to-reindex-a-datetime-based-multiindex-in-pandas/75106323)