Open jorisvandenbossche opened 8 years ago
xref #7895 (a lot of dupe as well), maybe close the other? dupe of #11226, but I like your example better xref #10549
If this is a leaf level then its easy to do this, but what if its not, then do you do a cartesian product of everything below that level? we could, but would want a broadcast
argument I think (though this just maybe making it even more complicated)
xref #7895
Curious why the expected behavior should be different?
I'm spent a few hours looking into the issue, but I am not familiar enough with index joins to know what the problem is. I think it might be happening in the function Index._join_level
found in pandas/core/indexes/base.py.
Within the function new_level, left_lev_indexer, right_lev_indexer = old_level.join( right, how=how, return_indexers=True )
is called. The variablehow
is a right join. Not sure whether the outputs are as expected here. new_level seems to be correct.
Finally near the end of the function
join_index = MultiIndex( levels=new_levels, codes=new_codes, names=left.names, verify_integrity=False, )
is called. The new_codes appear to be the same as the old multiindex codes. I wonder if this is the issue.
Hope that helps someone willing to fix it.
I ran into this yesterday and it caught me by surprise. If you do a reindex
without level it does add fill_value
for the missing entries in the flat index, and I expected this behavior to carry over to the particular level I specify.
So, yes, I would have very much expected it to be a cartesian product between the new index (passed into reindex) and all unique entries of the old index after the selected level got removed. For example, if I have a dataframe with a 3-level multi index:
df = pd.DataFrame(
np.arange(10).reshape(5, 2),
index=pd.MultiIndex.from_tuples(
[
("A", "a", 2),
("A", "b", 0),
("B", "a", 1),
("B", "a", 3),
("B", "d", 1),
]
),
)
| | 0 | 1 |
|:--------------|----:|----:|
| ('A', 'a', 2) | 0 | 1 |
| ('A', 'b', 0) | 2 | 3 |
| ('B', 'a', 1) | 4 | 5 |
| ('B', 'a', 3) | 6 | 7 |
| ('B', 'd', 1) | 8 | 9 |
and I reindex the "middle" index using new_index = pd.Index(["a", "b", "c", "d"])
then I would expect df.reindex(new_index, level=1)
to produce:
| | 0 | 1 |
|:--------------|----:|----:|
| ('A', 'a', 2) | 0 | 1 |
| ('A', 'b', 2) | nan | nan |
| ('A', 'c', 2) | nan | nan |
| ('A', 'd', 2) | nan | nan |
| ('A', 'a', 0) | nan | nan |
| ('A', 'b', 0) | 2 | 3 |
| ('A', 'c', 0) | nan | nan |
| ('A', 'd', 0) | nan | nan |
| ('B', 'a', 1) | 4 | 5 |
| ('B', 'b', 1) | nan | nan |
| ('B', 'c', 1) | nan | nan |
| ('B', 'd', 1) | 8 | 9 |
| ('B', 'a', 3) | 6 | 7 |
| ('B', 'b', 3) | nan | nan |
| ('B', 'c', 3) | nan | nan |
| ('B', 'd', 3) | nan | nan |
Which is equivalent to this (naive, but hopefully easy to read) python implementation:
level = 1
remaining_levels = df.reset_index(level=level, drop=True).index.unique()
new_index_tuples = []
for old_levels in remaining_levels:
for new_level in new_index:
new_index_tuples.append((*old_levels[:level], new_level, *old_levels[level:]))
target_index = pd.MultiIndex.from_tuples(new_index_tuples)
df.reindex(target_index)
(see also: https://stackoverflow.com/questions/75106282/how-to-reindex-a-datetime-based-multiindex-in-pandas/75106323)
Suppose the following dataframe and reindex operation:
Should this give the following?
I am not sure what the exact behaviour of the
level
keyword should be, but eg in the following example it does the selection of columns for each of label of the other level: