twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 707 forks source link

Cascading 3 upgrade strategy #1465

Open ianoc opened 8 years ago

ianoc commented 8 years ago

So I believe cascading 3 change won't be backwards compatible with 2.6. Though we obviously don't want to have too long running a PR. Thoughts?

1) I think we should probably have a cascading3 branch for now, and do several PR's into it. Its not ideal since we will need to be careful about incompatible changes to develop. But hopefully it wouldn't live long. I'm not sure how well it would work, but one more code involved but maybe easier stepping stone approach would be to stick an API/set of files to act as a bridge layer and have one for cascading 2/3. Outside of commons most of the changes look pretty isolated. (I'm in favor of this if it works, since I think we could get onto develop sooner).

2) Should we cut the dependencies of this down by eliminating some dependencies? The most obvious here is dfs-datastores for this.

3) What should our merge to develop criteria be? (1) All unit/integration tests pass (2) Some usage at Twitter/Stripe/Other ? (3) Should Tez working be a requirement? Or just having cascading 3 for hadoop being good enough to make it develop. (4) Versioning/ANN for scalding. Just next incompatible release sounds probably ok as per our current numbering. When is reasonable for us to aim for this to be in?

//cc @johnynek @cchepelov @rubanm

sriramkrishnan commented 8 years ago

I am OK with Tez being a non-requirement if it makes it easier to make it to develop. Tez would be great to have of course, but it could wait till after that?

johnynek commented 8 years ago

+1 to a short lived branch.

+1 to cascading 3 working on Hadoop for merge to develop/master. In my view, Tez is still somewhat experimental, so I wouldn't gate on that. And moving forward to the latest cascading is still progress.

I would do the merge to develop when Twitter's CI is green and integration tests pass on a snapshot published from the cascading3 branch. I'll try to coordinate similar at stripe, but I'm just getting ramped up here.

/cc @avibryant

cchepelov commented 8 years ago

(agreeing with 99% of what's being said here, especially not gating on Tez or Flink, since willing folks can now switch the fabric at runtime anyway)

re. backwards compatibility: indeed, it really can't be 100% backwards compatible. It seems most user code won't be affected at the source level (few Cascading concepts spilled above scalding, and fewer of those underwent interface chances), but there's always someone somewhere doing something. Safer to deny compatibility☺and binaries will break anyway.

cchepelov commented 8 years ago

Question with respect to testing the various fabrics:

Duplicating the tests that could be common would be a waste; some tests might be common to some fabrics but not all. Some will definitely remain fabric-specific. And what can be said of tests can also be said of initialization code (c.t.s.platform.LocalCluster & friends).

What do you think would be the best way to deal with this, taking into account that a given Classpath may only include exactly one fabric in addition to cascading-local?

/cc @simondumas

ianoc commented 8 years ago

We could separate out the tests from their execution for these ones that we want to share.

so we have scalding-hadoop-test which src package contains some test definitions which can be called from any test package. scalding-hadoop-test:test could be the legacy mr1 , Then we take maybe the simpler but noisy option of one package per cascading fabric we wish to support, but they can refer to anything in the source package from scalding-hadoop-test then. Otherwise you could do magic in sbt to add more envs to the test command I believe, but i don't particularly like that since I think keeping ./sbt test to be run all the tests will be hard to keep at that point

cchepelov commented 8 years ago

@ianoc This sounds good, with scalding-hadoop-test:test (and not just scalding-hadoop-test) depending on cascading-hadoop.

One side question, it might be useful at one point to have "long form" tests also run on all supported fabrics what is currently being done only in -local? This would be useful to check result accuracy, and also (and primarily) ensure that cascading's core and per-fabric planners get a shakeout that matters in a scalding context. I guess that's a later step anway.

ianoc commented 8 years ago

Definately -- some of the core tests btw do run in hadoop mode with cascading. The platform tests run in an hadoop local cluster. Testing Hadoop/Cascading interaction (mostly serialization) -- the normal tests can do run and runHadoop, where runHadoop runs in hadoop local mode, but it should use cascading's hadoop planner

rubanm commented 8 years ago

Based on the discussion here -- I'll be sending a series or PRs to move to cascading3 in piecemeal fashion (for ease of reviewing).