pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.19k stars 1.84k forks source link

Filter duration by string intervals #5370

Open braaannigan opened 1 year ago

braaannigan commented 1 year ago

Problem description

The string intervals are great. If would be nice to use them to filter pl.Duration columns:

from datetime import datetime

import polars as pl

start = date(2022,1,1)
stop = date(2022,1,2)
df = pl.DataFrame(
    {
        'date':pl.date_range(
            low = start,
            high = stop,
            interval='1h'
        ),
    }
)

(
     df
    .filter(
        pl.col("date").diff() < "2h"
    )
)
mcrumiller commented 1 year ago

You can use pl.duration, although I notice a bug: comparing using a pl.Duration with a pl.duration fails when the pl.Duration doesn't have units of ms.

# this creates an error
df.select(pl.col("date").diff() < pl.duration(hours=2))

# this does not
df.select(pl.col("date").diff().cast(pl.Duration("ms")) < pl.duration(hours=2))
mcrumiller commented 1 year ago

Ok, my statement about the bug is not the case, as per all of these working:

pl.select(pl.lit(timedelta(hours=1), dtype=pl.Duration("ms")) - pl.duration(hours=1))
pl.select(pl.lit(timedelta(hours=1), dtype=pl.Duration("us")) - pl.duration(hours=1))
pl.select(pl.lit(timedelta(hours=1), dtype=pl.Duration("ns")) - pl.duration(hours=1))
braaannigan commented 1 year ago

You can use pl.duration, although I notice a bug: comparing using a pl.Duration with a pl.duration fails when the pl.Duration doesn't have units of ms.

# this creates an error
df.select(pl.col("date").diff() < pl.duration(hours=2))

# this does not
df.select(pl.col("date").diff().cast(pl.Duration("ms")) < pl.duration(hours=2))

Yes, I just think Polars already has a parser for the strings and they are really snappy so why not?