pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.66k stars 1.79k forks source link

Replacing a value with null across many columns is very slow in a lazyframe #10887

Open mmcdermott opened 11 months ago

mmcdermott commented 11 months ago

I don't have full details on this issue or a simple code example, but I wanted to report the following issue I've observed. I have a somewhat small dataframe that has a number of "count" columns which are often dominated by zeros. I have many of these dataframes (so in total, it is a lot of data), and I've observed using fully collected data-frames that replacing zeros by nulls can significantly speed up read/write time from disk. My code that produces these dataframes, however, uses lazy frames. I added this block of code to do the replacement to my lazy frame, just before it was collected and written to disk:

df = df.with_columns(pl.when(cs.ends_with('count') != 0).then(cs.ends_with('count')).keep_name())

And suddenly my code slowed down dramatically. In comparison, If I instead collect the dataframe before running that line, then the overall code speeds up (because, as noted previously) the final writes on the dataframes with Nulls are faster. This seems like an error in the LazyFrame collection system, so I wanted to highlight it.

You can see the code I'm talking about here for an example here: https://github.com/mmcdermott/EventStreamGPT/blob/7e8cdda14bc8cea4674d14d7106c1cc6c8d15e47/EventStream/data/dataset_polars.py#L1766

Unfortunately I haven't constructed a minimum working example of the problem yet, but time permitting I'll try to add one soon. Regardless, I thought I should highlight this issue in case others experience the same thing.

ritchie46 commented 11 months ago

Can you make a minimal example? Then we can do some profiling.

mmcdermott commented 11 months ago

Yes, I will try, but it may take me some time. Apologies for the delay!

daviskirk commented 9 months ago

@mmcdermott did you find a solution here? I'm seeing the exact same thing. Can't come up with a simple example though.

However, I don't think it is inherently related to ANY lazyframe. Doing a collect and jumping back to the lazy API works ok as well, so my guess is that we're repeating operations from the lazyframes operation history or not correctly eliminating sub-expressions.

fast:

lazy_df.collect().with_columns(pl.when(...).then(...)) lazy_df.collect().lazy().with_columns(pl.when(...).then(...))

slow:

lazy_df.with_columns(pl.when(...).then(...))

mmcdermott commented 9 months ago

To be honest I forgot about this issue and don't recall if I ever really solved it or if I just converted this operation back to operating over Data Frames. Sorry!

On Wed, Nov 8, 2023 at 12:38 PM Davis Kirkendall @.***> wrote:

@mmcdermott https://github.com/mmcdermott did you find a solution here? I'm seeing the exact same thing. Can't come up with a simple example though.

However, I don't think it is inherently related to ANY lazyframe. Doing a collect and jumping back to the lazy API works ok as well, so my guess is that we're repeating operations from the lazyframes operation history or not correctly eliminating sub-expressions.

fast:

lazy_df.collect().with_columns(pl.when(...).then(...)) lazy_df.collect().lazy().with_columns(pl.when(...).then(...))

slow:

lazy_df.with_columns(pl.when(...).then(...))

— Reply to this email directly, view it on GitHub https://github.com/pola-rs/polars/issues/10887#issuecomment-1802357432, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADS5X3GDQKDAH4ERU2SPZTYDO7RJAVCNFSM6AAAAAA4JO3XM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBSGM2TONBTGI . You are receiving this because you were mentioned.Message ID: @.***>