pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.87k stars 1.92k forks source link

Apply not being optimized when using LazyFrame [Python] #3235

Open munro opened 2 years ago

munro commented 2 years ago

I would love it if Polars optimized calling apply on only rows used in the final collected output.

My use case is I have a slow Python function, and it would be conceptually easier if I could just apply that function to the entire dataframe, but then only run that slow function for the rows I'm filtering.

For the time being, I have some bespoke pipeline code to run afterwards after I filter the dataframe.

There are two optimization I saw that could happen:

I wrote 2 test cases to show what I mean, hopefully this explains it better

import polars as pl

def test_apply_filter_optimization():
    """Apply should't be called on rows filtered, using predicate pushdown"""
    called_apply = 0

    df = pl.DataFrame({"a": list(range(20))})

    def some_slow_function(value):
        nonlocal called_apply
        called_apply += 1
        return value * 2

    df2 = (
        df.lazy()
        .select([pl.col("a"), pl.col("a").apply(some_slow_function).alias("f_a")])
        .filter(pl.col("a") < 3)
        .collect()
    )

    assert df2.to_numpy().tolist() == [[0, 0], [1, 2], [2, 4]]

    # AssertionError: Expected python_udf to be called 3 times, not 20
    assert (
        called_apply == 3
    ), f"Expected python_udf to be called 3 times, not {called_apply}"

def test_apply_drop_optimization():
    """Apply shouldn't be called if the row was never collected"""
    called_apply = 0

    df = pl.DataFrame({"a": list(range(5))})

    def some_slow_function(value):
        nonlocal called_apply
        called_apply += 1
        return value * 2

    df2 = (
        df.lazy()
        .select([pl.col("a"), pl.col("a").apply(some_slow_function).alias("f_a")])
        .select([pl.col("a")])
        .collect()
    )

    assert df2.to_numpy().tolist() == [[0], [1], [2], [3], [4]]

    # AssertionError: Expected python_udf never to have been called, instead it was called 5 times
    assert (
        called_apply == 0
    ), f"Expected python_udf never to have been called, instead it was called {called_apply} times"
zundertj commented 2 years ago

Added performance label rather than feature, as the output is correct/unchanged. I just checked the Python example on latest master (July 17, 2022), and the latest assertions still fails despite #3313. Not sure if that PR was meant to fix this completely?

>       assert (
            called_apply == 0
        ), f"Expected python_udf never to have been called, instead it was called {called_apply} times"
E       AssertionError: Expected python_udf never to have been called, instead it was called 5 times
E       assert 5 == 0
ritchie46 commented 2 years ago

3313 only fixed the first function. I still need to do the latest.