twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 707 forks source link

Support rightJoin in Sketched #1525

Open reconditesea opened 8 years ago

reconditesea commented 8 years ago

Is there a particular reason that rightJoin is not supported in Sketched and there is no hashRight2 in Joiner class?

johnynek commented 8 years ago

Yes, there is. Because I didn't see how that would be possible and no one has corrected me if I am wrong.

Replicating joins, like the hash join and the sketched join, working in map reduce, operate key by key. So left Join is pretty clear. How would you do a right join?

I guess if you could keep a bit on each key in the hash table as to whether the key has been seen. Then at the end, for all keys not set you could emit a None on the left side. To be honest, I never saw that solution till just now. It might require a specialized cascading mapper operation.

Good question.

On Wednesday, February 24, 2016, Kevin Lin notifications@github.com wrote:

Is there a particular reason that rightJoin is not supported in Sketched and there is no hashRight2 in Joiner class?

— Reply to this email directly or view it on GitHub https://github.com/twitter/scalding/issues/1525.

P. Oscar Boykin, Ph.D. | http://twitter.com/posco | http://pobox.com/~boykin

dvryaboy commented 8 years ago

@johnynek you would need a deduping operation of some sort on the reduce side since a right hand key might be seen in the left side on another mapper, and (null, rkey) would not be a valid result row in that case..

johnynek commented 8 years ago

yep. That's what I forgot.

reconditesea commented 8 years ago

I didn't think we will need such support until recently I have a use case where the left side of a rightJoin has extreme key-skew. But yeah, I can see why rightJoin doesn't make sense in sketched. One cannot distribute a None key across the reducers.

I can split this join as two sequential map-reduce steps as a workaround.

dvryaboy commented 8 years ago

Sorted order can be helpful in those cases. If LHS is sorted, you can for example only emit (null, rkey) when you pass the sport where the key would need to be on the LHS. Obv. one would have to first build order of keys as a concept and then take advantage of that in a join implementation, so it's a big task.

On Thu, Feb 25, 2016 at 11:56 AM, Kevin Lin notifications@github.com wrote:

I didn't think we will need such support until recently I have a use case where the left side of a rightJoin has extreme key-skew. But yeah, I can see why rightJoin doesn't make sense in sketched. One cannot distribute a None key across the reducers.

I can split this join as two sequential map-reduce steps as a workaround.

— Reply to this email directly or view it on GitHub https://github.com/twitter/scalding/issues/1525#issuecomment-188952077.