pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.69k stars 17.91k forks source link

BUG: rolling with method='table' and apply sorts columns alphabetically #59666

Open valeju opened 2 months ago

valeju commented 2 months ago

Pandas version checks

Reproducible Example

import pandas as pd
import numpy as np

datetime_column = "datetime"

def sum_of_subtraction(values):
    col_1 = values[:, 0]
    col_2 = values[:, 1]
    return np.mean(col_1 - col_2)

cols_order_1 = ["a", "b"]
cols_order_2 = ["b", "a"]

df = pd.DataFrame(
    {
        datetime_column: pd.date_range("2020-01-01", periods=6),
        "a": [1, 2, 3, 4, 5, 6],
        "b": [6, 7, 8, 5, 6, 7],
    }
)
order_1_df = (
    df
    .rolling('3D', on=datetime_column, method='table')[cols_order_1]
    .apply(sum_of_subtraction, engine="numba", raw=True)
)

order_2_df = (
    df
    .rolling('3D', on=datetime_column, method='table')[cols_order_2]
    .apply(sum_of_subtraction, engine="numba", raw=True)
)

Issue Description

With the code above, the function sum_of_subtraction returns a different results if the columns of the input argument values are swapped (cf the example below).

x = np.array([[1, 2],[3, 4]])
y = x[:, [1, 0]]  # columns are swapped
print(sum_of_subtraction(x))  # returns -1.0
print(sum_of_subtraction(y))  # returns 1.0

Yet, using rolling with method='table' followed with apply returns same results, no matter or the columns ordering. Even though the columns were given with a specific ordering, when calling apply with engine='numba', the columns are sorted alphabetically. With the code above, here are the results:

order_1_df: a b datetime
0 -5 -5 2020-01-01 00:00:00
1 -5 -5 2020-01-02 00:00:00
2 -5 -5 2020-01-03 00:00:00
3 -3.66667 -3.66667 2020-01-04 00:00:00
4 -2.33333 -2.33333 2020-01-05 00:00:00
5 -1 -1 2020-01-06 00:00:00
order_2_df: a b datetime
0 -5 -5 2020-01-01 00:00:00
1 -5 -5 2020-01-02 00:00:00
2 -5 -5 2020-01-03 00:00:00
3 -3.66667 -3.66667 2020-01-04 00:00:00
4 -2.33333 -2.33333 2020-01-05 00:00:00
5 -1 -1 2020-01-06 00:00:00

Expected Behavior

The results when the columns are ordering as such ["b", "a"] must be like that (order_2_df dataframe):

b a datetime
0 5 5 2020-01-01 00:00:00
1 5 5 2020-01-02 00:00:00
2 5 5 2020-01-03 00:00:00
3 3.66667 3.66667 2020-01-04 00:00:00
4 2.33333 2.33333 2020-01-05 00:00:00
5 1 1 2020-01-06 00:00:00

Installed Versions

INSTALLED VERSIONS ------------------ commit : e4956ab403846387a435cd7b3a8f36828c23c0c7 python : 3.10.14 python-bits : 64 OS : Linux OS-release : 5.15.0-1066-azure Version : # 75-Ubuntu SMP Thu May 30 14:29:45 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : en_US.UTF-8 LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 3.0.0.dev0+1413.ge4956ab403 numpy : 2.0.2 dateutil : 2.9.0 pip : 24.2 Cython : None sphinx : None IPython : 8.26.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : None numba : 0.60.0 numexpr : None odfpy : None openpyxl : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : None python-calamine : None pytz : 2024.1 pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
rhshadrach commented 1 month ago

Thanks for the report!, further investigations and PRs to fix are welcome!