twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

First draft of pure-scalding memory backend #1697

Closed johnynek closed 7 years ago

johnynek commented 7 years ago

This follows up the thread of work leading to #1682

This gives an in-memory backend without using cascading (which for the basic tests is MUCH faster).

This is not a production quality backend yet:

  1. no support for joins (hashJoin yes cogroup no).
  2. parallelism has not been carefully tuned, so we only get very naive parallelism at the moment.

The main point of this is to exercise using the execution API without cascading in the loop. I think this proof of concept shows that a spark backend would not be very hard at this point and the memory backend should be a guide for someone looking to do that.

I think we should merge this despite it not being complete because the PR is already dense enough. I'd like to improve the quality of the test coverage and support all the cases in later PRs.

r? @fwbrasil @piyushnarang

johnynek commented 7 years ago

cc @ianoc

johnynek commented 7 years ago

will send an update addressing these comments. Thank you for taking the time to look.

johnynek commented 7 years ago

okay.

Sorry for the delay. Can you all take another look. As you know, this is a long line of changes and I am trying to keep each one somewhat digestible (shooting for ~400 lines of diff). This is slightly longer, so I'm hoping we can address any outstanding issues in a follow up.

I'd love to get the optimizations in place so we could think about releasing scalding 0.18 with this change to typed pipe, and in fact this memory platform is just a proof of concept that you can run without cascading. We can polish it more and make it as nice as we like, but the main purpose is to have a realistic example to prove that the API basically works without getting into the weeds of spark or flink.

piyushnarang commented 7 years ago

Looks good to me. Seems like the CI build has been hitting the 50 min timeout on the hadoop tests (noticed that on: https://github.com/twitter/scalding/pull/1700 as well). We'll need to either bump the timeout / maybe breakout the tests in that suite.