Closed 987sebasBeller closed 2 days ago
i think there is something wrong with slicing, if you pass another value, it returns the same thing. If you select a single value, it fails.
Thanks for the answer, but the problem is that I am selecting non-existent rows in the dataframe, so I would expect to get a empty dataframe.
Looks like it depends on the lexicographic order of the nonexistent string compared to the existing indices:
>>> people_df.loc["P":] # "P" < "Person1"
name age
Person1 Juan 15
Person2 Jorge 25
Person3 Sebas 30
"Q" > "Person1" # "Q" > "Person3"
>>> people_df.loc["Q":]
Empty DataFrame
Columns: [name, age]
Index: []
>>> index_df = ["Person1", "Person9"]
>>> people = [
... {"name": "Juan", "age": 15},
... {"name": "Sebas", "age": 30}
... ]
>>> people_df = pd.DataFrame(data=people, index=index_df)
>>> people_df["Person2":]
name age
Person9 Sebas 30
But the behavior changes when the indices are not in order:
>>> df=pd.DataFrame(data={"col":[1,2,3]},index=["A", "D", "B"])
>>> df
col
A 1
D 2
B 3
>>> df.loc["A":]
col
A 1
D 2
B 3
>>> df.loc["B":]
col
B 3
>>> df.loc["C":] # KeyError
>>> df.loc["D":]
col
D 2
B 3
>>> df.loc["E":] # KeyError
I looked into it further and found that the behavior is actually consistent with the current doc.
When using .loc with slices, if both the start and the stop labels are present in the index, then elements located between the two (including them) are returned.
If at least one of the two is absent, but the index is sorted, and can be compared against start and stop labels, then slicing will still work as expected, by selecting labels which rank between the two.
However, if at least one of the two is absent and the index is not sorted, an error will be raised.
But it's true that the doc does not mention an empty frame could be returned, which can be confusing in some cases.
@987sebasBeller, it seems that no fix is needed since the documentation correctly reflects the behavior of df.loc
. I think this issue can probably be closed, though further discussions are definitely welcomed in case anyone has more questions or suggested fixes.
@chaoyihu Yes, thank you.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The incorrect behavior (The issue)
There is a problem with the loc method because when you try to select a sequence of elements based on the index and the sequence does not match the index of the data frame, the result should be an empty data frame, but in this case it does not.
Another example:
Expected Behavior
The expected behavior
First example:
Another example:
Installed Versions
INSTALLED VERSIONS
commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.11.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22621 machine : AMD64 processor : AMD64 Family 23 Model 104 Stepping 1, AuthenticAMD byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252
pandas : 2.2.2 numpy : 1.26.3 pytz : 2024.1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 24.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 5.2.2 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.21.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None gcsfs : None matplotlib : 3.8.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.13.0 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None