twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Merge develop into cascading3 #1762

Closed piyushnarang closed 6 years ago

piyushnarang commented 6 years ago

Putting this up to have CI kick off tests so that we can move on to fixing tests and cleaning up stuff I might have messed up as the merge was pretty large.

CLAassistant commented 6 years ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

johnynek commented 6 years ago

looks like the hadoop tests has one test failure:

 Buffer(HadoopFlowStep[name: (1/3)], HadoopFlowStep[name: (2/3) output2], HadoopFlowStep[name: (3/3) output]) had size 3 instead of expected size 4 (PlatformTest.scala:654)

Thanks a ton for working on this. Would love to get this merged and move to cascading 3.

piyushnarang commented 6 years ago

Yeah there's an issue in the hadoop test and some failures in the estimator-tests. I think I might have missed something in the merge as there were a bunch of reducer & memory estimator changes that went into develop. I'm working on this on the side so if you're urgently blocked on this, I wouldn't want to hold you up so I can grant push access to my branch (or you can fork this one).

piyushnarang commented 6 years ago

Actually @johnynek looking at the job that fails:

class MergeTwoSinksForceToDiskTypedJob(args: Args) extends Job(args) {
  import TinyJoinAndMergeJob._

  val input1 = TypedPipe.from(joinInput1)
  val input2 = TypedPipe.from(joinInput2)

  val merged = (input1 ++ input2).asKeys.group.size.map { case (k, v) => (k, v.toInt) }

  merged
    .forceToDisk
    .write(output)

  merged
    .write(output2)
}

The check seems to be failing on the fact that we are expecting 4 steps (we get 3 though). Tracking back the original bug - https://github.com/cwensel/cascading/pull/52 it was more regarding the fact that the Cascading3 planner failed on this code path. I'm not sure if the 3 steps is a bad thing?

johnynek commented 6 years ago

@piyushnarang I agree. I think you can change the test to be <= 4 or == 3.

I think 3 is better right? 1 Map-only job for the second part, and 1 MR job for the forced part, and 1 Map only for the force -> write. Seems like 3 is what I would have expected.

piyushnarang commented 6 years ago

Yeah that's what I was thinking too. Ok, I switched this out. We have some failures in the estimators left and then we should be good to go (famous last words) :-)

johnynek commented 6 years ago

very exciting.

We are somehow hitting constant problems in planning now as Stripe. Many jobs take hours to plan and some have been stuck in planning for over 36 hours until we killed them.

I am hoping that cascading 3 can fix this issue for us.

piyushnarang commented 6 years ago

hey @johnynek - so our tests are green now. There's a lot of changes here as we've diverged a fair bit. I've taken a pass to sanity check but I might have missed something :-). Looking at the diff against develop (https://github.com/twitter/scalding/compare/develop...piyushnarang:cascading3) it largely looks like only cascading3 changes there.

johnynek commented 6 years ago

minor issues:

https://github.com/twitter/scalding/compare/develop...piyushnarang:cascading3#diff-0336b62cdc4bda6c450ac5c6a3e3fc52R273

I don't see why the recursion on the head uses the descriptions but not the tail. Why the change with respect to the current code?

johnynek commented 6 years ago

I think we should go ahead and merge this after @piyushnarang comments on my above question.

I reviewed the diff. Looks expected. The only issue, I guess is some template taps were removed. I hope that is not a major downstream pain.

Note: this merge is into the twitter:cascading3 branch, not develop.

@pankajroark can someone at Twitter kick-off a test run of a version of scalding against that branch? If it looks good, we should merge. We are really struggling with planner runtime issues at Stripe currently and I am hopeful the new planner will help. We are creating very big jobs (since our data is smaller than yours, people can write much more complex jobs that still complete in reasonable time), so we have hundreds of steps.

johnynek commented 6 years ago

i'm going to merge this into cascading3 branch and try to fix up with a followup PR changing the build to point to https for more repos.