pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.09k stars 1.94k forks source link

dt.epoch() is much slower than truediv() for the same operations #16716

Open Chuck321123 opened 5 months ago

Chuck321123 commented 5 months ago

Checks

Reproducible example

import pandas as pd
import polars as pl

num_rows = 1000000
utc_time = pd.date_range(start='2023-01-01', periods=num_rows, freq='ms')

# Create the DataFrame
df = pd.DataFrame({
    'UTC_Time': utc_time
})

df['UTC_Time'] = df['UTC_Time'].sort_values()

df = pl.DataFrame(df)

df = df.with_columns(pl.col("UTC_Time").truediv(1).alias("Unix1"))
df = df.with_columns(pl.col("UTC_Time").dt.epoch(time_unit="ns").alias("Unix2"))

# Display the first few rows
print(df.head())

%timeit df.with_columns(pl.col("UTC_Time").truediv(1))
%timeit df.with_columns(pl.col("UTC_Time").dt.epoch(time_unit="ns")) # Slightly faster

%timeit df.with_columns(pl.col("UTC_Time").truediv(1000))
%timeit df.with_columns(pl.col("UTC_Time").dt.epoch(time_unit="us")) # Much slower

%timeit df.with_columns(pl.col("UTC_Time").truediv(1000000))
%timeit df.with_columns(pl.col("UTC_Time").dt.epoch(time_unit="ms")) # Much slower

%timeit df.with_columns(pl.col("UTC_Time").truediv(1000000000))
%timeit df.with_columns(pl.col("UTC_Time").dt.epoch(time_unit="s")) # Even slower

Log output

16.8 µs ± 576 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
13.2 µs ± 321 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

1.65 ms ± 26.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
2.74 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

1.65 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
2.76 ms ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

1.67 ms ± 35 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
4.22 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Issue description

So, Im unsure if this is a bug, but I find it weird that dt.epoch(), a function specifically designed to convert datetime to unix format, is slower than truediv. Also it gets progressively worse when we convert with second-precision.

Expected behavior

That they are at least equally as fast

Installed versions

``` --------Version info--------- Polars: 0.20.31 Index type: UInt32 Platform: Windows-11-10.0.22631-SP0 Python: 3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 17:28:07) [MSC v.1916 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: matplotlib: 3.8.3 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.2.1 pyarrow: 15.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
MarcoGorelli commented 5 months ago

thanks @Chuck321123 for the report, will take a look