twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Infinite loop in one of the rules #1747

Closed johnynek closed 6 years ago

johnynek commented 6 years ago

We get a stack overflow from Dagon which I think is due to an infinite loop, in one of the property checks (see below).

I thought I had fixed this, but it may be more complex that I thought. The issue is likely the AddExplicitForks rule which actually add, rather than remove, nodes to the graph. It is possible that by adding you add a node that points to itself since the node you add may already exist. I tried to deal with this.

I think there are three fixes:

  1. see if we can fix the rule to not do this.
  2. see if we can improve dagon to deal with this case, since I think it is assuming the node added cannot reach a node equivalent to the one being replaced.
  3. remove this rule if we can't use it safely. Also note, we didn't have this rule until now, and required users to add explicit forks with .fork. Since in map-reduce, usually it is serialization and communication that is expensive, re-running functions often isn't the expensive part, so it is not clear if we actually gain much with this rule in any case (which has a goal of minimizing function application rather than map-reduce steps).
java.lang.StackOverflowError

    at scala.collection.immutable.HashMap.get(HashMap.scala:53)

    at com.stripe.dagon.HMap.get(HMap.scala:56)

    at com.stripe.dagon.HMap.apply(HMap.scala:53)

    at com.stripe.dagon.Expr$$anon$1.$anonfun$toFunction$1(Expr.scala:78)

    at com.stripe.dagon.FunctionK.apply(FunctionK.scala:10)

    at com.stripe.dagon.FunctionK.apply$(FunctionK.scala:9)

    at com.stripe.dagon.Expr$$anon$1.apply(Expr.scala:75)

    at com.stripe.dagon.Memoize$$anon$1.$anonfun$toFunction$2(Memoize.scala:39)

    at com.stripe.dagon.HCache.getOrElseUpdate(HCache.scala:38)

    at com.stripe.dagon.Memoize$$anon$1.$anonfun$toFunction$1(Memoize.scala:39)

    at com.stripe.dagon.FunctionK.apply(FunctionK.scala:10)

    at com.stripe.dagon.FunctionK.apply$(FunctionK.scala:9)

    at com.stripe.dagon.Memoize$$anon$1.apply(Memoize.scala:37)

    at com.stripe.dagon.Expr$$anon$1.$anonfun$toFunction$1(Expr.scala:80)

    at com.stripe.dagon.FunctionK.apply(FunctionK.scala:10)

    at com.stripe.dagon.FunctionK.apply$(FunctionK.scala:9)

    at com.stripe.dagon.Expr$$anon$1.apply(Expr.scala:75)

    at com.stripe.dagon.Memoize$$anon$1.$anonfun$toFunction$2(Memoize.scala:39)

    at com.stripe.dagon.HCache.getOrElseUpdate(HCache.scala:38)

    at com.stripe.dagon.Memoize$$anon$1.$anonfun$toFunction$1(Memoize.scala:39)

    at com.stripe.dagon.FunctionK.apply(FunctionK.scala:10)

    at com.stripe.dagon.FunctionK.apply$(FunctionK.scala:9)

    at com.stripe.dagon.Memoize$$anon$1.apply(Memoize.scala:37)

    at com.stripe.dagon.Expr$$anon$1.$anonfun$toFunction$1(Expr.scala:78)

    at com.stripe.dagon.FunctionK.apply(FunctionK.scala:10)

    at com.stripe.dagon.FunctionK.apply$(FunctionK.scala:9)

    at com.stripe.dagon.Expr$$anon$1.apply(Expr.scala:75)

    at com.stripe.dagon.Memoize$$anon$1.$anonfun$toFunction$2(Memoize.scala:39)

    at com.stripe.dagon.HCache.getOrElseUpdate(HCache.scala:38)

    at com.stripe.dagon.Memoize$$anon$1.$anonfun$toFunction$1(Memoize.scala:39)

    at com.stripe.dagon.FunctionK.apply(FunctionK.scala:10)

    at com.stripe.dagon.FunctionK.apply$(FunctionK.scala:9)

    at com.stripe.dagon.Memoize$$anon$1.apply(Memoize.scala:37)

    at com.stripe.dagon.Expr$$anon$1.$anonfun$toFunction$1(Expr.scala:80)
oscar-stripe commented 6 years ago

this is a bug with how the rule is written. See: https://github.com/stripe/dagon/pull/18 for an example of how to write it correctly.