pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.49k stars 1.87k forks source link

Upsampling unexpectedly discards higher frequency data #14327

Open jwhitaker-gridcog opened 7 months ago

jwhitaker-gridcog commented 7 months ago

Checks

Reproducible example

from datetime import datetime as dt, timedelta
import io
import polars as pl
import polars.testing

def test_upsample() -> None:
    df = pl.from_records(
        [
            (dt(2021, 1, 1) + t, val)
            for (t, val) in [
                (timedelta(minutes=0), 1),
                (timedelta(minutes=1), 1),
            ]
        ],
        schema={"t": pl.Datetime("ms", "UTC"), "val": pl.Int32()},
    ).sort("t")

    upsampled = df.upsample("t", every="120s")

    print(upsampled)
    # shape: (1, 2)
    # ┌─────────────────────────┬─────┐
    # │ t                       ┆ val │
    # │ ---                     ┆ --- │
    # │ datetime[ms, UTC]       ┆ i32 │
    # ╞═════════════════════════╪═════╡
    # │ 2021-01-01 00:00:00 UTC ┆ 1   │
    # └─────────────────────────┴─────┘

    print(df)
    # shape: (2, 2)
    # ┌─────────────────────────┬─────┐
    # │ t                       ┆ val │
    # │ ---                     ┆ --- │
    # │ datetime[ms, UTC]       ┆ i32 │
    # ╞═════════════════════════╪═════╡
    # │ 2021-01-01 00:00:00 UTC ┆ 1   │
    # │ 2021-01-01 00:01:00 UTC ┆ 1   │
    # └─────────────────────────┴─────┘

    pl.testing.assert_frame_equal(upsampled, df)

if __name__ == "__main__":
    test_upsample()

Log output

> env POLARS_VERBOSE=1 poetry run python demo.py
shape: (1, 2)
┌─────────────────────────┬─────┐
│ t                       ┆ val │
│ ---                     ┆ --- │
│ datetime[ms, UTC]       ┆ i32 │
╞═════════════════════════╪═════╡
│ 2021-01-01 00:00:00 UTC ┆ 1   │
└─────────────────────────┴─────┘
shape: (2, 2)
┌─────────────────────────┬─────┐
│ t                       ┆ val │
│ ---                     ┆ --- │
│ datetime[ms, UTC]       ┆ i32 │
╞═════════════════════════╪═════╡
│ 2021-01-01 00:00:00 UTC ┆ 1   │
│ 2021-01-01 00:01:00 UTC ┆ 1   │
└─────────────────────────┴─────┘
Traceback (most recent call last):
  File "/home/jarrad/src/sandpit/pl-upsample/demo.py", line 46, in <module>
    test_upsample()
  File "/home/jarrad/src/sandpit/pl-upsample/demo.py", line 42, in test_upsample
    pl.testing.assert_frame_equal(upsampled, df)
  File "/home/jarrad/src/sandpit/pl-upsample/.venv/lib/python3.10/site-packages/polars/testing/asserts/frame.py", line 104, in assert_frame_equal
    raise_assertion_error(
  File "/home/jarrad/src/sandpit/pl-upsample/.venv/lib/python3.10/site-packages/polars/testing/asserts/utils.py", line 12, in raise_assertion_error
    raise AssertionError(msg) from cause
AssertionError: DataFrames are different (number of rows does not match)
[left]:  1
[right]: 2

Issue description

When running upsample, existing data at a higher frequency than the upsample target seems to be discarded. E.g. in the example, my dataframe has two rows at 00:00 and 00:01, and I upsample to 2m (same result with 1h, 61s, ...).

After upsampling, the second row has been discarded.

Expected behavior

Maybe naïvely, I expected existing higher-frequency data to be preserved.

If this behaviour from polars is intentional, then I'd at least have expected to see something in the upsample docs like

Data in df at a higher frequency than every will be discarded.

or at a minimum

Upsample's behaviour on data at a higher frequency than every is undefined and subject to change in future versions of Polars.

Thanks for your work maintaining Polars!

Why are you upsampling to a lower frequency than your data, are you stupid?

For context, my situation here was that I have data that in a mix of daily and hourly. There are missing records in both parts. I was attempting to upsample to daily first, because I wanted to treat missing records here differently to missing hourly records. However, this got me to hit this bug/unexpected behaviour, because upsampling to daily chucked away some of the hourly records.

Installed versions

``` --------Version info--------- Polars: 0.20.7 Index type: UInt32 Platform: Linux-6.6.10-76060610-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: hvplot: matplotlib: numpy: openpyxl: pandas: pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
deanm0000 commented 7 months ago

I don't know for sure but I suspect it is intended behavior. This way, after running upsample you can rely on the spacing from row to row being consistent. I think you need to do something like this:

df.join(
df.upsample("t", every="120s"),
on='t', how='outer_coalesce'
)
cmdlineluser commented 7 months ago

Not sure on the intended behaviour, but this is what is happening:

upsample generates the range and performs a left join.

https://github.com/pola-rs/polars/blob/6ab6550ad476227cca550cd24a692d1a00a8c937/crates/polars-time/src/upsample.rs#L177-L193

The range in this case just returns the start value (as the interval exceeds the range)

df.select(
   range = pl.date_range(pl.col("t").min(), pl.col("t").max(), interval="120s")
)
# shape: (1, 1)
# ┌─────────────────────────┐
# │ range                   │
# │ ---                     │
# │ datetime[ms, UTC]       │
# ╞═════════════════════════╡
# │ 2021-01-01 00:00:00 UTC │
# └─────────────────────────┘
deanm0000 commented 7 months ago

I guess there are two schools of thought.

  1. Add a parameter to upsample to do an outer join instead of left
  2. Shrug and suggest people do the full join themselves.

Doing #2 is certainly better than what I said above which does 2 joins when you only need 1.

vlcinsky commented 2 months ago

I have the same issue.

One correction: it is not discarding only existing higher frequency, in fact it discards any data, which are not matching the target sampling frequency.

It is surprising and I would invite:

My attempt to rewrite my code from pandas to polars failed on this and shows, that in this regard (resampling) pandas is providing better service. (I am aware pandas is very long with us and I really love polars)

WolfEYc commented 1 month ago

I have the same issue.

One correction: it is not discarding only existing higher frequency, in fact it discards any data, which are not matching the target sampling frequency.

It is surprising and I would invite:

  • either the documentation

    • explicitly explaining this behaviour
    • and providing an example how to succeed
  • or the upsample method behave more as expected by many

My attempt to rewrite my code from pandas to polars failed on this and shows, that in this regard (resampling) pandas is providing better service. (I am aware pandas is very long with us and I really love polars)

Yeah in my case, i have low frequency data, but because the points do not perfectly align it strait up discards data. Very misleading function!