Open BrandonHaynes opened 9 years ago
A workaround to this issue is to force a shuffle and store the result:
data = scan(data);
store(data, data, [x]);
@jingjingwang can you take a look at this? Since we're using LOAD()
now in all of our example code this will be hugely confusing for users. Hopefully it's an easy fix...
I took a look and let me explain what happened. There are two operator encodings in MyriaX, TableScanEncoding
and DbQueryScanEncoding
. The 1st one only scans a table, the 2nd one has a SQL statement than can do arbitrary query on multiple tables. The query plan generated by raco in this case uses DbQueryScanEncoding
to push some optimizations down (I think).
For TableScanEncoding
, MyriaX only assign the operator to workers that stored the relation, but for DbQueryScanEncoding
, it's hard to do the same thing since -- 1. we would need a SQL parser on MyriaX (to get the list of touched relations), but MyriaX is not supposed to do the query parsing job, and 2. even if we have a list of accessed relations, should we only assign the operator to workers that have all the relations?
PS if we remove the line R2 = [from R1 emit count(*)];
then it works since raco then generates TableScanEncoding
instead of DbQueryScanEncoding
. Also if we click the option "Disable pushing computation into database" then it also works.
So I believe a workaround could be "Disable pushing computation into database". If we want to go further, maybe we can add a field into DbQueryScanEncoding
for the list of touched relations in its SQL, and it'll be great if raco can fill it in. Otherwise we need to have a SQL parser in MyriaX.
Just realized this: for the parser, we would need to call each DBMS to let it "explain" the query, then parse the result and find all the scans. A singleton parser won't work.
Is creating an empty relation on all non-load workers not a potential workaround?
I think it'll also be a workaround. (although doesn't sound very correct since they shouldn't get the query)
Yes, I'm concerned about that approach corrupting the semantics of our metadata. We would no longer be able to assume that a relation partitioned on a given worker actually had data on that worker without checking the size of the partition.
It sounds like fixing this before the demo isn't realistic and we should pursue one of these workarounds:
store()
statements (and continually remind users to do this)myria-web
to make "Disable pushing computation into database" the defaultUnless 2) significantly slows a lot of queries, I lean toward 2). Even if it does slow down some queries, existing queries will still be correct when we change the myria-web
default, and we should be able to do that pretty soon. What do the rest of you think?
Yep, I also vote for 2) for the demo. I think we will not be demonstrating big queries anyway. Although we need to check if they can still pass with 2).
As a precondition, load a dataset with the
load
function:Now execute an aggregation over the resulting relation:
This query fails with the exception below. Note that the loaded relation exists only in the catalog of one worker (the one that won the singleton assignment), and it appears that other workers are incorrectly trying to access it.