Closed ttim closed 5 years ago
this hurts everyone's performance just because some jobs are using mutable values.
Can they instead disable this optimization on their end?
computing hashCode has to traverse an entire object generally.
Can they instead disable this optimization on their end?
Unfortunately that's impossible since this particular optimization is enabled unconditionally in CascadingBackend
.
When we actually tried to disable it it broke our scalding job completely which feels related to this code but that's I'm unsure.
While this change has some perf overhead it's important to understand this is very rare - it happens only if you have merged identical typed pipes.
@johnynek What do you think? Do you have this kind of graphs (a ++ a
) in production at Stripe?
I have a feeling this optimization is more about making cascading to work with this kind of graphs (originally you added this optimization to fix https://github.com/twitter/scalding/issues/1786) and not to actually improve performance of the job. Since it's really hard to imagine this kind of graphs in real life (in fact we found this issue because of typo in a test) I don't think this performance overhead matters.
one principle I have long had for TypedPipe/Execution/etc... is that if it compiles, it should run.
I think there are probably legitimate cases of a ++ a
that come up when people are writing functions on TypedPipe
.
If we start trying to detect mutation of values, where do we stop? This isn't the only place that could go wrong.
We could keep adding more and other checks.
I'd really prefer we say: use mutable values at your own risk, and it is your responsibility to write tests to show that things work. They are not explicitly supported since it would hamstring our ability to do many optimizations.
I think rule of thumb of mutations should be:
if optimization might break in case of mutations then it might have sense to check for mutations
I don't think this is very common.
Unfortunately people write stupid code (and don't test it excessively) and I think if we have an ability to fail them fast with clear message it's a good idea to do that. Especially if cost of that is small and it happens rarely.
For instance, even map fusion is not safe if you do mutating side effects:
scala> (0 to 3).map { x => println(s"hey: $x"); x }.map { x => println(s"bye: $x"); x }
hey: 0
hey: 1
hey: 2
hey: 3
bye: 0
bye: 1
bye: 2
bye: 3
scala> (0 to 3).map { x => println(s"hey: $x"); println(s"bye: $x"); x }
hey: 0
bye: 0
hey: 1
bye: 1
hey: 2
bye: 2
hey: 3
bye: 3
that's a fundamental optimization we do, similarly with moving filter functions around.
That's a good point. I guess I'm agree to abandon this PR then.
We still need to fix topK
aggregation thou.
During internal testing in Twitter we found that
dedupMerge
optimization sometimes breaks jobs. Root cause of that turned out to be using of mutable values (to be precisePriorityQueue
through topK monoid) in your job code andFill
function which duplicates this mutable values.In this PR I've added guard against mutability in
Fill
function. It's not guaranteed to catch all mutability issues but at least some of them for almost no cost.