pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.92k stars 1.93k forks source link

Utilize sorted flags for `.filter` #9127

Open endremborza opened 1 year ago

endremborza commented 1 year ago

Problem description

Hello.

I'm new to polars but so far I like it quite a bit. I'm also open to contributing a solution to this problem if I choose to stick with polars, but first I thought I'd just lay it out simply:

pldf = pl.DataFrame({"A": iter(range(10_000_000))})

i = 4_000_000

pldf.filter(pl.col("A") > i) # 17.6 ms ± 526 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

pldf.set_sorted("A").filter(pl.col("A") > i) # 13.7 ms ± 92.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

pldf.slice(pldf["A"].search_sorted(i)) # 55.8 µs ± 3.06 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

it seems that there is some sort of optimization going on but clearly not close to its full potential, as I don't see a clear reason why option two should not have the same computation as option 3.

I'm also new to rust, so doing this would require significant effort for me, but if someone points me to some parts of the code, I could give it a look.

alexander-beedie commented 1 year ago

FYI (unrelated to the sorted flag observation, but useful knowledge) you can init the example frame 10x faster by dropping the leading iter, as there is a fast-path for range ;)

pl.DataFrame({"A":range(10_000_000)})
endremborza commented 1 year ago

I know, but that adds the sorted flag automatically, and I wanted to do a comparison without it

endremborza commented 1 year ago

another thing I just wanted to make a quick note of: a seemingly more complicated but still very useful optimization would be on multi-level joins / sorts:

df = pl.DataFrame({"A": [1,2] * 500_000, "B": range(1_000_000)}).sample(999_000)
dfs = df.sort("A", "B")

df2 = pl.DataFrame({"A": [1, 2, 3, 4] * 50, "B": range(200)})

df.join(df2, on=["A", "B"]) # 746 µs ± 23.5 µs per loop
dfs.join(df2, on=["A", "B"]) # 762 µs ± 18.1 µs per loop