Open munro opened 2 years ago
Added performance label rather than feature, as the output is correct/unchanged. I just checked the Python example on latest master (July 17, 2022), and the latest assertions still fails despite #3313. Not sure if that PR was meant to fix this completely?
> assert (
called_apply == 0
), f"Expected python_udf never to have been called, instead it was called {called_apply} times"
E AssertionError: Expected python_udf never to have been called, instead it was called 5 times
E assert 5 == 0
I would love it if Polars optimized calling
apply
on only rows used in the final collected output.My use case is I have a slow Python function, and it would be conceptually easier if I could just apply that function to the entire dataframe, but then only run that slow function for the rows I'm filtering.
For the time being, I have some bespoke pipeline code to run afterwards after I filter the dataframe.
There are two optimization I saw that could happen:
LazyFrame.apply
on rows that have been filteredLazyFrame.apply
, then don't run it at all!I wrote 2 test cases to show what I mean, hopefully this explains it better