Open MarcoGorelli opened 1 month ago
@MarcoGorelli I'll add this alternative method which is faster than the current solution if you choose to continue on this issue in the future:
import pandas as pd
import polars as pl
num_rows = 1000000
utc_time = pd.date_range(start='2023-01-01', periods=num_rows, freq='s')
df = pd.DataFrame({
'UTC_Time': utc_time
})
df['UTC_Time'] = df['UTC_Time'].sort_values()
print(df.head())
df = pl.DataFrame(df)
df = df.with_columns(pl.col("UTC_Time").dt.truncate("2m").alias("Method1"))
df = df.with_columns(pl.from_epoch((pl.col("UTC_Time")
.dt.epoch(time_unit="ns")
// (2 * 60 * 1_000_000_000))
* (2 * 60 * 1_000_000_000),
time_unit="ns").alias("Method2"))
%timeit df.with_columns(pl.col("UTC_Time").dt.truncate("2m"))
%timeit df.with_columns(pl.from_epoch((pl.col("UTC_Time").dt.epoch(time_unit="ns") // (2 * 60 * 1_000_000_000)) * (2 * 60 * 1_000_000_000), time_unit="ns"))
Console print:
4.62 ms ± 989 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.63 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This was done on the latest version 1.0.0 alpha version 1.
thanks - i'm seeing a much smaller difference though
3.16 ms ± 77.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.61 ms ± 186 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I'm starting a tracker based on https://github.com/pola-rs/polars/issues/16531
The things that need doing are:
The slowpath for truncate can probably be optimised by doing the following: