pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.33k stars 1.86k forks source link

Sort merge join #22

Closed ritchie46 closed 2 years ago

ritchie46 commented 4 years ago

Sort merge join can be faster than hash join when Series are sorted and maybe when they are not.

ninkoze commented 4 years ago

Can I take up this issue and work on it if you are not working on it .

ritchie46 commented 4 years ago

Be my guest! :)

ErikDeSmedt commented 3 years ago

Just wanted to highlight that sorted-joins do not require exact matches.

There is a large benefit for timeseries analysis here. It is often useful to join two dataframes on timestamp (non-exact) matches

A simple example would be to see which person would get on what bus from the two data-sets provided below. Here you want to join on timestamp (non-exact) and bus-stop to find out which passenger boarded on what bus.

Bus stops

Timestamp bus stop
14:00 Bus A Stop 1
14:10 Bus B Stop 2
14:15 Bus A Stop 2
14 :20 Bus A Stop 3
Passenger Timestamp Passenger stop
14:02 John Stop 3
14:09 Brad Stop 2
ritchie46 commented 3 years ago

I can understand this can be useful, but has this got a name? This isn't exactly a join? Feels like a bucket search or something like that.

ErikDeSmedt commented 3 years ago

Such an operation is often named as_of_join. Existing implementations are pandas and flint which provides an implementation on top of Spark.

ritchie46 commented 2 years ago

out of scope.

ritchie46 commented 2 years ago

Such an operation is often named as_of_join.

Btw. as_of_join is implemented. :)