Open bmyerz opened 8 years ago
Something @mbalazin suggested this morning: we should test a single select on a broadcast relation to verify it is only run on one worker (presumably chosen at random).
@mbalazin @senderista I think we can implement that as
Select(predicate())[Select(WORKER_ID()==0)[Broadcast[...
For logical operators that turn into communicating physical operators (groupby, join, ...), we can choose to either compute locally and retain the broadcasted property on the output or precede the operator with a Select(WORKER_ID==0)
.
f the final Store does not indicate broadcasted
then we have a Select(WORKER_ID==0)
at the top that may be pushed down if desired.
It does occur to me that we have no tool in Raco for evaluating queries with respect to physical properties, only logical. (our FakeDB has a lot of no-ops because it is logical: https://github.com/uwescience/raco/blob/a094b9cb558932cb44abb4e666f7836059fc7e8b/raco/fakedb.py#L493). For testing Shuffle what I did was inspect whether the plan has features we expect. I think Broadcast actually has more correctness pitfalls, so I'm less satisfied with this level of testing. It is conceivable to extend our FakeDB evaluator to emulate a parallel query engine for testing purposes....
changes go here https://github.com/uwescience/raco/tree/bmyerz/broadcast
Can Raco treat WORKER_ID()
specially and only send a LocalFragment
containing an equality predicate on that attribute with a constant(s) to workers with those ID(s)? That would optimize queries over broadcast relations where we don't want duplicate results (automatically insert WORKER_ID() == 0
predicate and push it all the way down to workers).
Select(WORKERID)
all the way to Scan we can produce a json plan for Myria that only names a single worker from the catalog
Now that we are adding broadcast support in the catalog, I suspect that many tests would "break" if they were performed on broadcasted input relations.
The reason is that broadcast is intended to be a physical property, not a logical property. Raco doesn't yet reason in a robust way about physically (and not logically) broadcasted relations.