pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.18k stars 1.95k forks source link

`df.upsample` hangs when an "object" type column is present #18445

Open blaylockbk opened 2 months ago

blaylockbk commented 2 months ago

Checks

Reproducible example

Given a dataframe with an "object" type column

from datetime import datetime
from pathlib import Path
import polars as pl

df = pl.DataFrame(
    {
        "filepath": [
            Path("/this/path/file1"),
            Path("/this/path/file2"),
            Path("/this/path/file3"),
            Path("/this/path/file4"),
        ],
        "creation_date": [
            datetime(2024, 1, 1),
            datetime(2024, 1, 2),
            datetime(2024, 1, 5),
            datetime(2024, 1, 6),
        ],
        "file_size": [1, 2, 3, 4],
        "file_name": ["file1", "file2", "file3", "file4"],
    }
)
┌──────────────────┬─────────────────────┬───────────┬───────────┐
│ filepath         ┆ creation_date       ┆ file_size ┆ file_name │
│ ---              ┆ ---                 ┆ ---       ┆ ---       │
│ object           ┆ datetime[μs]        ┆ i64       ┆ str       │
╞══════════════════╪═════════════════════╪═══════════╪═══════════╡
│ /this/path/file1 ┆ 2024-01-01 00:00:00 ┆ 1         ┆ file1     │
│ /this/path/file2 ┆ 2024-01-02 00:00:00 ┆ 2         ┆ file2     │
│ /this/path/file3 ┆ 2024-01-05 00:00:00 ┆ 3         ┆ file3     │
│ /this/path/file4 ┆ 2024-01-06 00:00:00 ┆ 4         ┆ file4     │
└──────────────────┴─────────────────────┴───────────┴───────────┘

"upsampling" appears to hang...

df.upsample("creation_date", every="1d")  #<-- never finishes

Log output

none

Issue description

I have a dataframe with an "object" type column filled with pathlib.Path objects. When I try to "upsample" the dataframe by the "creation_date" column, it hangs.

However, if I remove the "object" column, upsample works fine.

df.select(pl.exclude(pl.Object)).upsample("creation_date", every="1d")
shape: (6, 3)
┌─────────────────────┬───────────┬───────────┐
│ creation_date       ┆ file_size ┆ file_name │
│ ---                 ┆ ---       ┆ ---       │
│ datetime[μs]        ┆ i64       ┆ str       │
╞═════════════════════╪═══════════╪═══════════╡
│ 2024-01-01 00:00:00 ┆ 1         ┆ file1     │
│ 2024-01-02 00:00:00 ┆ 2         ┆ file2     │
│ 2024-01-03 00:00:00 ┆ null      ┆ null      │
│ 2024-01-04 00:00:00 ┆ null      ┆ null      │
│ 2024-01-05 00:00:00 ┆ 3         ┆ file3     │
│ 2024-01-06 00:00:00 ┆ 4         ┆ file4     │
└─────────────────────┴───────────┴───────────┘

Expected behavior

I expected the upsample method to work even in the presence of an "object" type column, and fill values in the "object" column with null like it does for the other columns

Installed versions

``` --------Version info--------- Polars: 1.6.0 Index type: UInt32 Platform: Linux-5.14.21-150400.24.119-default-x86_64-with-glibc2.31 Python: 3.12.5 | packaged by conda-forge | (main, Aug 8 2024, 18:36:51) [GCC 12.4.0] ----Optional dependencies---- adbc_driver_manager altair cloudpickle 3.0.0 connectorx deltalake fastexcel fsspec 2024.6.1 gevent great_tables matplotlib 3.9.2 nest_asyncio 1.6.0 numpy 2.0.1 openpyxl pandas 2.2.2 pyarrow 17.0.0 pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
MarcoGorelli commented 2 months ago

thanks @blaylockbk for the report, will take a look