pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.59k stars 17.57k forks source link

BUG: Dataframe's loc method works incorrectly when selecting a sequence from an index that does not exist. #59076

Closed 987sebasBeller closed 2 days ago

987sebasBeller commented 1 week ago

Pandas version checks

Reproducible Example

import pandas as pd

# Dataframe data
people=[
     {"name":"Juan","age":15}
    ,{"name":"Jorge","age":25}
    ,{"name":"Sebas","age":30}
    ]

# Dataframe index
index_df=["Person1","Person2","Person3"]

people_df=pd.DataFrame(data=people,index=index_df)

# Example with incorrect result, A non-existent index sequence is selected
print("Real Result:",people_df.loc["Person":],sep="\n")

# Example with correct result
print("Expected Result:",people_df.loc["example":],sep="\n")

Issue Description

The incorrect behavior (The issue)

There is a problem with the loc method because when you try to select a sequence of elements based on the index and the sequence does not match the index of the data frame, the result should be an empty data frame, but in this case it does not.

# Dataframe data
people=[
     {"name":"Juan","age":15}
    ,{"name":"Jorge","age":25}
    ,{"name":"Sebas","age":30}
    ]

# Dataframe index
index_df=["Person1","Person2","Person3"]

people_df=pd.DataFrame(data=people,index=index_df)

# Example with incorrect result, A non-existent index sequence is selected
print("Real Result:",people_df.loc["Person":],sep="\n")

#The output:
output="""
Real Result:
          name  age
Person1   Juan   15
Person2  Jorge   25
Person3  Sebas   30
"""

Another example:


population_dict={
    'Chuquisaca':626000,
    'La Paz':26,
    'Pando':1500
}

extension_dict={
    'Chuquisaca':626000,
    'La Paz':26
}

population_series=pd.Series(population_dict)

extension_series=pd.Series(extension_dict)

# A dataframe created by two series, The data is about a fake population of bolivia.
bolivia_data=pd.DataFrame({'poblacion':population_series,'extension':extension_series})

# Example with incorrect result, A non-existent index sequence is selected
print(bolivia_data.loc['Cochabamba':'O',])

# The output:
output="""
        poblacion  extension
La Paz         26       26.0
"""

Expected Behavior

The expected behavior

First example:

# Dataframe data
people=[
     {"name":"Juan","age":15}
    ,{"name":"Jorge","age":25}
    ,{"name":"Sebas","age":30}
    ]

# Dataframe index
index_df=["Person1","Person2","Person3"]

people_df=pd.DataFrame(data=people,index=index_df)

# Example with correct result, it prints an empty dataframe
print("Expected Result:",people_df.loc["example":],sep="\n")

#The output:
output="""
Expected Result:
Empty DataFrame
Columns: [name, age]
Index: []
"""

Another example:


population_dict={
    'Chuquisaca':626000,
    'La Paz':26,
    'Pando':1500
}

extension_dict={
    'Chuquisaca':626000,
    'La Paz':26
}

population_series=pd.Series(population_dict)

extension_series=pd.Series(extension_dict)

# A dataframe created by two series, The data is about a fake population of bolivia.
bolivia_data=pd.DataFrame({'poblacion':population_series,'extension':extension_series})

# Example with correct result, it prints an empty dataframe
print(bolivia_data.loc['another_example':,])

# The output:
output="""
Empty DataFrame
Columns: [poblacion, extension]
Index: []
"""

Installed Versions

INSTALLED VERSIONS


commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22621 machine : AMD64 processor : AMD64 Family 23 Model 104 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252

pandas : 2.2.2 numpy : 1.26.3 pytz : 2024.1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 5.2.2 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.21.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

samukweku commented 1 week ago

i think there is something wrong with slicing, if you pass another value, it returns the same thing. If you select a single value, it fails.

987sebasBeller commented 1 week ago

Thanks for the answer, but the problem is that I am selecting non-existent rows in the dataframe, so I would expect to get a empty dataframe.

chaoyihu commented 1 week ago

Looks like it depends on the lexicographic order of the nonexistent string compared to the existing indices:

>>> people_df.loc["P":]  # "P" < "Person1"
          name  age
Person1   Juan   15
Person2  Jorge   25
Person3  Sebas   30

"Q" > "Person1"  #  "Q" > "Person3"
>>> people_df.loc["Q":]
Empty DataFrame
Columns: [name, age]
Index: []
>>> index_df = ["Person1", "Person9"]
>>> people = [
...     {"name": "Juan", "age": 15},
...     {"name": "Sebas", "age": 30}
... ]
>>> people_df = pd.DataFrame(data=people, index=index_df)
>>> people_df["Person2":]
          name  age
Person9  Sebas   30

But the behavior changes when the indices are not in order:

>>> df=pd.DataFrame(data={"col":[1,2,3]},index=["A", "D", "B"])
>>> df
   col
A    1
D    2
B    3
>>> df.loc["A":]
   col
A    1
D    2
B    3
>>> df.loc["B":]
   col
B    3
>>> df.loc["C":]  # KeyError
>>> df.loc["D":]
   col
D    2
B    3
>>> df.loc["E":]  # KeyError
chaoyihu commented 5 days ago

I looked into it further and found that the behavior is actually consistent with the current doc.

When using .loc with slices, if both the start and the stop labels are present in the index, then elements located between the two (including them) are returned.

If at least one of the two is absent, but the index is sorted, and can be compared against start and stop labels, then slicing will still work as expected, by selecting labels which rank between the two.

However, if at least one of the two is absent and the index is not sorted, an error will be raised.

But it's true that the doc does not mention an empty frame could be returned, which can be confusing in some cases.

chaoyihu commented 2 days ago

@987sebasBeller, it seems that no fix is needed since the documentation correctly reflects the behavior of df.loc. I think this issue can probably be closed, though further discussions are definitely welcomed in case anyone has more questions or suggested fixes.

987sebasBeller commented 2 days ago

@chaoyihu Yes, thank you.