twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.49k stars 703 forks source link

Add MultiJoinFunction to remove more anonymous functions #1827

Closed johnynek closed 6 years ago

johnynek commented 6 years ago

This removes the use of anonymous functions for joining and replaces with a sealed class. This is a worthwhile improvement in any case but it may also fix #1795 since we use externalizer at each element in the ADT.

This hints at an approach we could use elsewhere: in the Composed functions, we could use Externalizer, so that each function can be serialized separately.

We also fix a few issues:

  1. we were incorrectly closing over Pair which transitively would pull in the entire graph reachable from the join.
  2. our tracking of 1-1-ness or many -> 1-ness was incomplete. This improves it, so that we potentially improve the performance of our joins since we can see that the right-side does not need to be recomputed for each value on the left without blowing up memory.

This may be the problem with #1795 but since we don't yet have a repro test, we can't be sure.

Please take a look @fwbrasil @ianoc

While this may not fix the issue, it is certainly a bug, so we should merge it in my view.

johnynek commented 6 years ago

Well, it didn't fix it. We tripped the same error. It is still worth merging, but not the problem.

It's a puzzle how MapGroup is involved here.

johnynek commented 6 years ago

maybe we got lucky, but this went green on the first try (which has been pretty rare).

johnynek commented 6 years ago

If we got lucky, we got lucky twice... Not that common. So, this seems positive to me.

johnynek commented 6 years ago

Three times green. I think this is a fix.

ianoc commented 6 years ago

lgtm

(With the comment i was really referring really how the layered multi-joins worked not the require itself, but all good)