pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.19k stars 1.84k forks source link

upsample not working if arguments are only sorted within group #18229

Closed thomascamminady closed 3 weeks ago

thomascamminady commented 3 weeks ago

Checks

Reproducible example

from datetime import datetime

import polars as pl

df = pl.DataFrame(
    {
        "time": [
            datetime(2021, 2, 1),
            datetime(2021, 1, 1),  # this is 2021-04-01 in the docs, i.e. sorted
            datetime(2021, 5, 1),
            datetime(2021, 6, 1),
        ],
        "groups": ["A", "B", "A", "B"],
        "values": [0, 1, 2, 3],
    }
)

df_upsampled = df.upsample(
    time_column="time", every="1mo", group_by="groups", maintain_order=True
)

Log output

---------------------------------------------------------------------------
InvalidOperationError                     Traceback (most recent call last)
/var/folders/1c/6_s1_dhd2xngnxyrz3vnpqfr0000gq/T/ipykernel_51814/920196386.py in ?()
     15     }
     16 )#.sort("time")
     17 
     18 
---> 19 df_upsampled = df.upsample(
     20     time_column="time", every="1mo", group_by="groups", maintain_order=True
     21 )
     22 

~/Dev/performance_management_chart/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py in ?(*args, **kwargs)
     87         def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     88             _rename_keyword_argument(
     89                 old_name, new_name, kwargs, function.__qualname__, version
     90             )
---> 91             return function(*args, **kwargs)

~/Dev/performance_management_chart/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py in ?(self, time_column, every, group_by, maintain_order)
   6419 
   6420         every = parse_as_duration_string(every)
   6421 
   6422         return self._from_pydf(
-> 6423             self._df.upsample(group_by, time_column, every, maintain_order)
   6424         )

InvalidOperationError: argument in operation 'upsample' is not sorted, please sort the 'expr/series/column' first

Issue description

I'm not sure if this is a bug or desired behavior, but it was somewhat unintuitive behavior. I would like to upsample my data, but group_by some other variable (groups). My assumption was that if I do sort('groups', 'date') and then upsample(...., group_by='groups') that this counts as sorted because it is sorted within each group.

Quoting from the upsample doc:

Result will be sorted by time_column (but note that if group_by columns are passed, it will only be sorted within each group).

So I would assume that this should work similarly for the input. In the MWE, although the time column isn't sorted, it is sorted within a group.

Expected behavior

I would think that something like this should always work

df = pl.DataFrame(.....).sort("groups","time")
df.upsample(time_column="time", every="1mo", group_by="groups")

Installed versions

``` --------Version info--------- Polars: 1.4.1 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.11.9 (main, Apr 2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: nest_asyncio: 1.6.0 numpy: 2.0.1 openpyxl: pandas: 2.2.2 pyarrow: 17.0.0 pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
MarcoGorelli commented 3 weeks ago

thanks @thomascamminady for the report, looks like a bug

thomascamminady commented 3 weeks ago

Closed as per https://github.com/pola-rs/polars/pull/18264