pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.59k stars 17.57k forks source link

BUG: freq after sort values differs between arm Mac and ubuntu #59092

Open phofl opened 1 week ago

phofl commented 1 week ago

Pandas version checks

Reproducible Example

pdf = pd.DataFrame({"z": 1})
pdf.z.sort_values()

2020-12-31    1
2021-01-01    1
2021-01-02    1
2021-01-03    1
2021-01-04    1
             ..
2021-04-05    1
2021-04-06    1
2021-04-07    1
2021-04-08    1
2021-04-09    1
Freq: D, Name: z, Length: 100, dtype: int64

Issue Description

This preserves the freq on ubuntu but loses it on Mac

Expected Behavior

don't know, consistency above everything else

I can't test the nightlies on ubuntu right now, that's why it's only 2.2.2

Installed Versions

INSTALLED VERSIONS ------------------ commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.12.4.final.0 python-bits : 64 OS : Darwin OS-release : 23.3.0 Version : Darwin Kernel Version 23.3.0: Wed Dec 20 21:33:31 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T8112 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : None.UTF-8 pandas : 2.2.2 numpy : 2.0.0 pytz : 2024.1 dateutil : 2.9.0 setuptools : 70.1.0 pip : 24.0 Cython : None pytest : 8.2.2 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.4 IPython : 8.25.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2024.6.0 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 16.1.0 pyreadstat : None python-calamine : None pyxlsb : None s3fs : 2024.6.0 scipy : None sqlalchemy : None tables : None tabulate : None xarray : 2024.6.0 xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None None
phofl commented 1 week ago

cc @jbrockmendel thoughts here?

jbrockmendel commented 1 week ago

I don't have an ubuntu env ATM, so speculating: sort_values go through nargsort which I suspect may have undefined behavior in non-unique cases.

BTW the example in the OP is not reproducible, needs an index passed to DataFrame.

Siddharth-Latthe-07 commented 2 days ago

@phofl pandas has historically had issues with preserving the freq attribute of DatetimeIndex after certain operations. The sort_values method should ideally not affect the freq attribute, but inconsistencies can arise due to differences in underlying implementations or versions. u can try this code:-

import pandas as pd

# Create a DataFrame with a single column
pdf = pd.DataFrame({"z": 1}, index=pd.date_range(start='2020-12-31', periods=100, freq='D'))

# Sort the values in the column
sorted_values = pdf.z.sort_values()

# Restore the frequency attribute
sorted_values.index.freq = pdf.index.freq

# Print the sorted values
print(sorted_values)

concl:- the behavior of pandas should ideally be consistent across platforms, there are occasional discrepancies. By explicitly managing the freq attribute, you can ensure your code behaves as expected regardless of the operating system.