twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Add a guard on mutability in `Fill` function #1894

Closed ttim closed 5 years ago

ttim commented 5 years ago

During internal testing in Twitter we found that dedupMerge optimization sometimes breaks jobs. Root cause of that turned out to be using of mutable values (to be precise PriorityQueue through topK monoid) in your job code and Fill function which duplicates this mutable values.

In this PR I've added guard against mutability in Fill function. It's not guaranteed to catch all mutability issues but at least some of them for almost no cost.

johnynek commented 5 years ago

this hurts everyone's performance just because some jobs are using mutable values.

Can they instead disable this optimization on their end?

https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/Config.scala#L272

johnynek commented 5 years ago

computing hashCode has to traverse an entire object generally.

ttim commented 5 years ago

Can they instead disable this optimization on their end?

Unfortunately that's impossible since this particular optimization is enabled unconditionally in CascadingBackend. When we actually tried to disable it it broke our scalding job completely which feels related to this code but that's I'm unsure.

While this change has some perf overhead it's important to understand this is very rare - it happens only if you have merged identical typed pipes.

ttim commented 5 years ago

@johnynek What do you think? Do you have this kind of graphs (a ++ a) in production at Stripe?

I have a feeling this optimization is more about making cascading to work with this kind of graphs (originally you added this optimization to fix https://github.com/twitter/scalding/issues/1786) and not to actually improve performance of the job. Since it's really hard to imagine this kind of graphs in real life (in fact we found this issue because of typo in a test) I don't think this performance overhead matters.

johnynek commented 5 years ago

one principle I have long had for TypedPipe/Execution/etc... is that if it compiles, it should run.

I think there are probably legitimate cases of a ++ a that come up when people are writing functions on TypedPipe.

If we start trying to detect mutation of values, where do we stop? This isn't the only place that could go wrong.

We could keep adding more and other checks.

I'd really prefer we say: use mutable values at your own risk, and it is your responsibility to write tests to show that things work. They are not explicitly supported since it would hamstring our ability to do many optimizations.

ttim commented 5 years ago

I think rule of thumb of mutations should be:

if optimization might break in case of mutations then it might have sense to check for mutations

I don't think this is very common.

Unfortunately people write stupid code (and don't test it excessively) and I think if we have an ability to fail them fast with clear message it's a good idea to do that. Especially if cost of that is small and it happens rarely.

johnynek commented 5 years ago

For instance, even map fusion is not safe if you do mutating side effects:

scala> (0 to 3).map { x => println(s"hey: $x"); x }.map { x => println(s"bye: $x"); x }
hey: 0
hey: 1
hey: 2
hey: 3
bye: 0
bye: 1
bye: 2
bye: 3

scala> (0 to 3).map { x => println(s"hey: $x"); println(s"bye: $x"); x }
hey: 0
bye: 0
hey: 1
bye: 1
hey: 2
bye: 2
hey: 3
bye: 3

that's a fundamental optimization we do, similarly with moving filter functions around.

ttim commented 5 years ago

That's a good point. I guess I'm agree to abandon this PR then.

We still need to fix topK aggregation thou.