Closed nownikhil closed 3 years ago
Regarding having the SortedTake
as first class AST element would allow us to use Top.of
from Beam to implement: https://beam.apache.org/releases/javadoc/2.32.0/org/apache/beam/sdk/transforms/Top.html.
If we could have a way of keeping current behavior for other backends but Beam using Top would be the safest from the POV of Beam.
But Nikhil and I aren't sure this change to the AST is easy or viable.
I think a way to solve it would be to pattern match on the Monoid:
case EmptyGuard(MapValueStream(SumAll(pq: PriorityQueueMonoid[V])) =>
val ordering = pq.zero.comparator()
// now use beam's built in take-of...
but an additional approach would be to use an immutable heap, and that is really a better approach:
and we could implement that in scalding and not call out to algebird's mutable implementation. I really regret the mutable implementation in algebird. We should have implemented an immutable heap there.
Should we add dependency of Cats in scalding or copy paste the implementation ?
I would copy it to avoid the dependency, and copy the tests. (I wrote pairing heap, btw). Just leave the copyright headers and the license is the same, so it would be fine.
But if you don't want to spend that time now, the pattern matching approach would also work without that.
Implementing this on #1949. I think we can close this issue and follow up there.
Hi folks, We just implemented Beam Runner for scalding. One open issue is with using sortedTake. Since it uses PriorityQueue Monoid we mutate the input elements and that causes the job to fail. This check is enabled for Direct Runner to catch issues during testing and might not be same for Dataflow runner.
We might break consistency guarantees if we mutate input elements. https://stackoverflow.com/questions/43142900/apache-beam-returns-input-values-must-not-be-mutated-in-any-way-when-using-lo
What would be the best way to solve this?
Maybe extend PQ Monoid and overwrite method plus to not mutate bigger PQ and just return a new one and then use it, or have SortedTake as a member of AST and implement it for every runner.
Happy to hear a clever solution to this.