twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Confusing API for setting the number of reducers for a sketch join #1152

Open joshualande opened 9 years ago

joshualande commented 9 years ago

Please forgive my ignorance if I say something that is misleading or wrong.

I think the syntax for specifying the number of reducers during a sketch join is confusing.

To join two tables in scalding as specify the number of reducers for the MR job, the syntax is:

val pipe1 = typedPipe1.groupBy { ... }
val pipe2 = typedPipe2.groupBy { ... }
val joined = pipe1
  .join(pipe2)
  .withReducers(100)

As far as I can tell, if you want to do the same thing, but using a sketch join, the syntax is:

val joined = pipe1
  .sketch(100)
  .join(pipe2)

I think this is confusing because it seems to imply that 100 reducers are used for the count-min sketch (when apparently only 1 is actually used), and also you can't specify withReducers following the join like you can in the other case.

To me, a simpler syntax would be

val joined = pipe1
  .sketch
  .join(pipe2)
  .withReducers(100)

I hope that what I said is true. Tell me if you think this makes sense.

Best,

Joshua

joshualande commented 9 years ago

@johnynek, tell me if this makes sense!

johnynek commented 9 years ago

It does to me. @avibryant what do you think of this?

avibryant commented 9 years ago

Everything @joshualande says there is true, and I think the proposed change is on balance a good one. I remember thinking the same thing at this point and am not sure exactly why I decided against it. FWIW, I think the main argument for keeping it the way it is, is that the number of reducers is a pretty critical parameter to the sketch join (arguably much more so than in a regular join) because it affects the replication factors used and so the total volume of data shuffled, rather than just how that data is partitioned. So a) you arguably want to force people to explicitly choose a value here (otherwise we're going to have to provide an arbitary default) and b) you might want to keep it with the other params like delta and eps. But overall, I think I agree that consistency is better.