pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.24k stars 1.84k forks source link

Optimize for simple math? #18321

Open abstractqqq opened 3 weeks ago

abstractqqq commented 3 weeks ago

Checks

Reproducible example

This is not urgent at all. Just a curious observation.

Not sure if this is feasible to be honest, because we don't technically know the type of variables..

But in the case when we know the type, or in the case when we are adding a column with a literal, the optimizer should notice that (pl.col("x1") + 0.) is the same as just pl.col("x1")

%timeit df.lazy().select(pl.col("x1")).collect()
%timeit df.lazy().select(pl.col("x1") + 0.).collect()

12 µs ± 37.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) 25.2 µs ± 238 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The situation occurs when 0 is the default value for some more complicated expressions. Say, in a linear regression with an intercept (bias term), we do y = b0 + x1 b1 + x2 b2 + x3 * b3. Typically we set b0 = 0 if we don't wish to fit the intercept. We can always check whether b0 is 0.. But that is more code.

I am hoping that the optimizer can optimize it away when we are sure that we are adding a literal 0 to some numerical column.

Same goes with a literal 1 * some numerical column.

Log output

No response

Issue description

See above

Expected behavior

Optimize should recognize the following cases as no-op:

literal 0 + number column literal 1 * number column

The question is whether we know the column is of numerical type at the time when the optimizer looks at the expression. My suspicion is that we don't...

Installed versions

--------Version info--------- Polars: 1.4.0 Index type: UInt32 Platform: Linux-6.10.2-arch1-1-x86_64-with-glibc2.40 Python: 3.12.4 (main, Jun 7 2024, 06:33:07) [GCC 14.1.1 20240522]
Julian-J-S commented 3 weeks ago

Not that easy imo.

Integer + Float -> Float

This should be the case no matter what the Float value is. We should not have a special case where Integer + Float<0.0> -> Integer

Integer + 0.0 is therefore equal to "Cast Integer to Float" which is no noop afaik because the underlying bits change in the number representation 🤔

barak1412 commented 3 weeks ago

We have rule for literal(a) + literal(b) -> literal(a + b).

I guess it should not be that hard, but maybe I miss something. I will be glad to look at it after I will finish my current pull request (only tests and CR are left).

connor-elliott commented 3 weeks ago

https://github.com/pola-rs/polars/issues/139

ritchie46 commented 2 weeks ago

Not that easy imo.

Integer + Float -> Float

This. It isn't a no-op. It is a cast. These timings should not be the same if you multiply an integer with a float.

@abstractqqq can you share your complete example?

abstractqqq commented 2 weeks ago

Not that easy imo.

Integer + Float -> Float

This. It isn't a no-op. It is a cast. These timings should not be the same if you multiply an integer with a float.

@abstractqqq can you share your complete example?

import numpy as np
import polars as pl

df = pl.DataFrame({
    "a": np.random.random(size = 1000)
})

%timeit df.select(pl.col("a"))
%timeit df.select(pl.col("a") + 0.)