uwescience / raco

Compilation and rule-based optimization framework for relational algebra. Raco is the language, optimization, and query translation layer for the Myria project.
Other
72 stars 19 forks source link

Tests for Broadcast #511

Open bmyerz opened 8 years ago

bmyerz commented 8 years ago

Now that we are adding broadcast support in the catalog, I suspect that many tests would "break" if they were performed on broadcasted input relations.

The reason is that broadcast is intended to be a physical property, not a logical property. Raco doesn't yet reason in a robust way about physically (and not logically) broadcasted relations.

senderista commented 8 years ago

Something @mbalazin suggested this morning: we should test a single select on a broadcast relation to verify it is only run on one worker (presumably chosen at random).

bmyerz commented 8 years ago

@mbalazin @senderista I think we can implement that as Select(predicate())[Select(WORKER_ID()==0)[Broadcast[...

For logical operators that turn into communicating physical operators (groupby, join, ...), we can choose to either compute locally and retain the broadcasted property on the output or precede the operator with a Select(WORKER_ID==0).

f the final Store does not indicate broadcasted then we have a Select(WORKER_ID==0) at the top that may be pushed down if desired.

bmyerz commented 8 years ago

It does occur to me that we have no tool in Raco for evaluating queries with respect to physical properties, only logical. (our FakeDB has a lot of no-ops because it is logical: https://github.com/uwescience/raco/blob/a094b9cb558932cb44abb4e666f7836059fc7e8b/raco/fakedb.py#L493). For testing Shuffle what I did was inspect whether the plan has features we expect. I think Broadcast actually has more correctness pitfalls, so I'm less satisfied with this level of testing. It is conceivable to extend our FakeDB evaluator to emulate a parallel query engine for testing purposes....

bmyerz commented 8 years ago

changes go here https://github.com/uwescience/raco/tree/bmyerz/broadcast

senderista commented 8 years ago

Can Raco treat WORKER_ID() specially and only send a LocalFragment containing an equality predicate on that attribute with a constant(s) to workers with those ID(s)? That would optimize queries over broadcast relations where we don't want duplicate results (automatically insert WORKER_ID() == 0 predicate and push it all the way down to workers).

bmyerz commented 8 years ago