Sort merge join - Githubissues

ritchie46 commented 4 years ago

Sort merge join can be faster than hash join when Series are sorted and maybe when they are not.

ninkoze commented 4 years ago

Can I take up this issue and work on it if you are not working on it .

ritchie46 commented 4 years ago

Be my guest! :)

ErikDeSmedt commented 3 years ago

Just wanted to highlight that sorted-joins do not require exact matches.

There is a large benefit for timeseries analysis here. It is often useful to join two dataframes on timestamp (non-exact) matches

A simple example would be to see which person would get on what bus from the two data-sets provided below. Here you want to join on timestamp (non-exact) and bus-stop to find out which passenger boarded on what bus.

Bus stops

Timestamp	bus	stop
14:00	Bus A	Stop 1
14:10	Bus B	Stop 2
14:15	Bus A	Stop 2
14 :20	Bus A	Stop 3

Passenger	Timestamp	Passenger	stop
14:02	John	Stop 3
14:09	Brad	Stop 2

ritchie46 commented 3 years ago

I can understand this can be useful, but has this got a name? This isn't exactly a join? Feels like a bucket search or something like that.

ErikDeSmedt commented 3 years ago

Such an operation is often named as_of_join. Existing implementations are pandas and flint which provides an implementation on top of Spark.

ritchie46 commented 2 years ago

out of scope.

ritchie46 commented 2 years ago

Such an operation is often named as_of_join.

Btw. as_of_join is implemented. :)

pola-rs / polars

Sort merge join #22