pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.49k stars 1.98k forks source link

Don't guarantee left join ordering. #19576

Open ritchie46 opened 3 weeks ago

ritchie46 commented 3 weeks ago

Description

This shouldn't have been guaranteed, but left as an implementation detail.

https://github.com/pola-rs/polars/blob/c3c38a9ddc13d7b0b0d1c413f5183c1ee8b06709/py-polars/polars/lazyframe/frame.py#L4443

Link

No response

s-banach commented 3 weeks ago

Hoo boy, this one is going to break some code.

orlp commented 3 weeks ago

There should be a preserve_order attribute added, defaulting to None, which can be set to "left" or "right".

@s-banach Without breaking this promise the streaming join will be slow by default, because you can't do a partitioned join if you must preserve order. Or at least, it would require a slow re-combining and re-sorting step afterwards.

And if order is preserved we can't switch which side of the join is a build and probe side either, in streaming. That's something we'd like to be able to do in the future as you'd much rather have a small build side.

orlp commented 2 weeks ago

I think we can already add this preserve_order parameter and implement it before 2.0 hits.

ritchie46 commented 2 weeks ago

Yes, maintain_order it's called then. We already use that.