Open domoritz opened 10 years ago
(sort-merge join is not a language feature, but rather a rule in our eventual optimizer.)
But somebody has to write the optimizer rule for that. This issue is for language support and optimizer support.
It isn't clear whether this refers to ORDER BY (a language feature) or sort-merge join (an optimizer feature). These should be two separate issues.
ORDER BY because we don't have a merge consumer in myria yet.
Distributed merge can lead to deadlock for multiple producers/multiple consumers: see discussion in http://dl.acm.org/citation.cfm?id=152611. There are various workarounds discussed in that reference.
We already have a global Merge consumer that can run on the coordinator, along with local MergeJoin and InMemoryOrderBy operators. We would need to push sorts down to Postgres for data that didn't fit in memory, but we may already have what we need to support ORDER BY
explicitly in MyriaL. The Raco and Myria algebras already support ORDER BY
(there is a rewrite rule to convert the Raco OrderBy
logical operator to the Myria MyriaInMemoryOrderBy
physical operator), but there is no syntax support in the MyriaL parser. We should prioritize implementing ORDER BY
in MyriaL based on user needs and limitations of the supporting operators.
We also need to implement a rewrite rule to decompose a global OrderBy
logical operator into a global Merge
of local OrderBy
operators, somewhat akin to the DecomposeGroupBy
rule (do we also need a new Merge
logical operator?). We should also have a rewrite rule for ORDER BY
+ LIMIT
, so the global Merge
only fetches the top k
results from each local OrderBy
(the existing CollectBeforeLimit
rule doesn't seem to preserve order in the final output).
I totally don't remember writing https://github.com/uwescience/myria/commit/555907f3ac73e0fed6a08d29b3c36cd9677b7114. Anyway, note that this is not a merge consumer! So it won't merge the inputs from children but only the inputs from different operators.
Right, so you're saying we need a proper exchange operator for Merge? I'll open an issue for that, thanks!
@bmyerz would appreciate feedback on adding a Merge
operator to the logical algebra (possibly unnecessary since it only exists to physically implement global logical OrderBy
).
I'd prefer not to see a merge operator at the logical level.
Is the use case here top k? That's a reasonable logical operator.
In fact, I'd go further and claim logical order by is always unnecessary except at the root, just like in SQL, except for plans involving stateful apply.
(SQL gets around that by using a separate order by clause specifically for window functions.)
On Monday, May 9, 2016, Tobin Baker notifications@github.com wrote:
@bmyerz https://github.com/bmyerz would appreciate feedback on adding a Merge operator to the logical algebra (possibly unnecessary since it only exists to physically implement global logical OrderBy).
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/uwescience/raco/issues/174#issuecomment-218067109
@senderista I agree with Bill on this
@billhowe @bmyerz OK, sounds like we just need a new Myria Merge
physical operator (corresponding to the unimplemented Merge
exchange operator) targeted by a rewrite rule which decomposes logical OrderBy
into Merge
+ physical (local) OrderBy
?
Also, can we go ahead and push logical OrderBy
into SQL even in the absence of ORDER BY
support in MyriaL? That would allow us to use Postgres indexes to optimize local sorting (and may be the only feasible alternative when local data is too large for the InMemoryOrderBy
operator). Since this will pretty much be a requirement for efficiently supporting ORDER BY
in MyriaL, why not get it implemented and tested first?
@senderista Ya, we can do the local OrderBy (and insert a Merge so that the plan is consistent with logical OrderBy).
order by
translation to Merge(Order by(...
Something @mbalazin pointed out is that for top-k queries with small k, we can just run local top-k queries at each worker (possibly pushing them into Postgres), and sort the combined results in memory at the coordinator using the existing InMemoryOrderBy
operator. That should be a pretty simple rewrite rule in Raco, and would eliminate the dependency on a global Merge
operator for the top-k scenario.
MyriaL should support sorting. We have support for sorted scans and an in memory order by. Later, we probably want a sort merge join.