Open LucasG0 opened 3 years ago
slightly subjective what should happen here.
see #39775 for many similar cases, this is possibly a duplicate.
I think this is a bit different than #39775, which is more related to what happens if there are missing keys.
Here, the issue is that if the MultiIndex
contains duplicates, and you ask for a key that is not duplicated in the index, you get a Series
. If there are no duplicate keys in the MultiIndex
, you get a single value. I ran into this in some code I was writing, where I knew there might be duplicates, or not, after some merging, and then the type of the result was inconsistent, dependent on whether there were duplicate keys in the index.
Here's another example:
import pandas as pd
df = pd.DataFrame(
[["a", 1, 10], ["a", 1, 20], ["b", 2, 30]], columns=["ab", "ot", "val"]
).set_index(["ab", "ot"])
print(df)
s2 = df["val"].loc[("b", 2)]
print("result")
print(s2)
print()
print("s2 type", type(s2))
print()
df2 = pd.DataFrame(
[["a", 1, 10], ["c", 1, 20], ["b", 2, 30]], columns=["ab", "ot", "val"]
).set_index(["ab", "ot"])
print(df2)
s3 = df2["val"].loc[("b", 2)]
print("result")
print(s3)
print()
print("s3 type", type(s3))
Here's the output:
val
ab ot
a 1 10
1 20
b 2 30
result
ab ot
b 2 30
Name: val, dtype: int64
s2 type <class 'pandas.core.series.Series'>
val
ab ot
a 1 10
c 1 20
b 2 30
result
30
s3 type <class 'numpy.int64'>
In both cases, I am doing .loc
using the key ("b", 2)
. In the first case, other indices are duplicated, and I get a Series
as the result. In the second case, no indices are duplicated, and I get a single value as the result.
So now I have to check the type of the result to determine what computation to do next. This is very unfriendly!
I can confirm this bug still exists for 2d DataFrame.
Copy Paste Example Below:
import pandas as pd
df = pd.DataFrame(
index = pd.MultiIndex.from_tuples(
[
('a', 'b'),
('a', 'b'),
('a', 'c'),
('b', 'c'),
('b', 'd')
]
),
columns = ['Tens', 'Hundreds']
)
df['Tens'] = [10,20,30,40, 50]
df['Hundreds'] = [100,200,300,400, 500]
print(df)
print('\n\n')
print('Returning Series Instead of Value')
print(df.loc[('b', 'c'), 'Tens'])
df = pd.DataFrame(
index = pd.MultiIndex.from_tuples(
[
('a', 'b'),
('a', 'd'),
('a', 'c'),
('b', 'c'),
('b', 'd')
]
),
columns = ['Tens', 'Hundreds']
)
df['Tens'] = [10,20,30,40, 50]
df['Hundreds'] = [100,200,300,400, 500]
print()
print(df)
print('\n\n')
print('Returning Series Instead of Value')
print(df.loc[('a', 'c'), 'Tens'])
Output:
Tens Hundreds
a b 10 100
b 20 200
c 30 300
b c 40 400
d 50 500
Returning Series Instead of Value
b c 40
Name: Tens, dtype: int64
Tens Hundreds
a b 10 100
d 20 200
c 30 300
b c 40 400
d 50 500
Returning Value Instead of Series
30
[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
When indexing a unique index value with
.loc
on aMultiIndex
containing duplicated values, the return type is aSeries
while it is a raw value for singleIndex
.Expected Output
I think we should expect a raw value to keep consistent with
Index
behavior.Output of
pd.show_versions()