pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.1k stars 1.83k forks source link

Feature request: Faster backward- and forward_fill() functions #16875

Open Chuck321123 opened 2 months ago

Chuck321123 commented 2 months ago

Description

So maybe not the highest priority right now, but I would be happy if we got faster backward- and forward_fill() functions as I think there are more optimization potential to these functions. By running this code:

import pandas as pd
import numpy as np
import polars as pl

np.random.seed(123)

n_rows = 100_000_000

random_numbers = np.random.rand(n_rows)
nan_mask = np.random.rand(n_rows) < 0.5
random_numbers[nan_mask] = np.nan

# Create DataFrame
df = pd.DataFrame({
    'RandomNumbers': random_numbers
})

print(df.head(10))

df = pl.DataFrame(df)

df = df.with_columns(pl.col("RandomNumbers").fill_nan(None).alias("Results"))

%timeit df.with_columns(pl.col("RandomNumbers").backward_fill().alias("Results"))

I get these benchmarks: 1.17 s ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Obviously, it becomes even slower if you forward and backward fill over groups. Would be nice if someone could find a way to improve these functions.

cmdlineluser commented 2 months ago

Just for reference: https://github.com/pola-rs/polars/issues/15480#issuecomment-2129888688

More improvement with branchless filling is possible still but low priority at the moment, as it's rather labour-intensive to write.

deanm0000 commented 2 months ago

did you mean for this to be a future request instead of a feature request?

Chuck321123 commented 2 months ago

@deanm0000 My bad for the misspelling. In reality it's an optimization request