Aggregation fails on dataset loaded via `load`

BrandonHaynes commented 9 years ago

As a precondition, load a dataset with the load function:

data = load("file:///mnt/csv", csv(schema(x:int, y:int, z:int)));
store(data, data);

Now execute an aggregation over the resulting relation:

R1 = scan(data);
R2 = [from R1 emit count(*)];
store(R2, result);

This query fails with the exception below. Note that the loaded relation exists only in the catalog of one worker (the one that won the singleton assignment), and it appears that other workers are incorrectly trying to access it.

edu.washington.escience.myria.DbException: Query #256.2 failed: ErrorCode: 0, SQLState: 42P01, Msg: ERROR: relation "public:adhoc:data" does not exist
  Position: 164
    at edu.washington.escience.myria.parallel.MasterSubQuery$WorkerExecutionInfo$2.operationComplete(MasterSubQuery.java:105)
    at edu.washington.escience.myria.parallel.LocalSubQueryFutureListener.operationComplete(LocalSubQueryFutureListener.java:45)
    at edu.washington.escience.myria.util.concurrent.OperationFutureBase.notifyListener(OperationFutureBase.java:583)
    at edu.washington.escience.myria.util.concurrent.OperationFutureBase.notifyListeners(OperationFutureBase.java:542)
    at edu.washington.escience.myria.util.concurrent.OperationFutureBase.wakeupWaitersAndNotifyListeners(OperationFutureBase.java:151)
    at edu.washington.escience.myria.util.concurrent.OperationFutureBase.setFailure0(OperationFutureBase.java:506)
    at edu.washington.escience.myria.parallel.LocalSubQueryFuture.setFailure(LocalSubQueryFuture.java:69)
    at edu.washington.escience.myria.parallel.MasterSubQuery.workerFail(MasterSubQuery.java:320)
    at edu.washington.escience.myria.parallel.QueryManager.workerFailed(QueryManager.java:308)
    at edu.washington.escience.myria.parallel.Server$MessageProcessor.run(Server.java:183)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at edu.washington.escience.myria.util.concurrent.RenamingThreadFactory$1.run(RenamingThreadFactory.java:33)
    Suppressed: edu.washington.escience.myria.DbException: Worker #1 failed: ErrorCode: 0, SQLState: 42P01, Msg: ERROR: relation "public:adhoc:data" does not exist
  Position: 164
        at edu.washington.escience.myria.util.ErrorUtils.mergeSQLException(ErrorUtils.java:60)
        at edu.washington.escience.myria.accessmethod.JdbcAccessMethod.tupleBatchIteratorFromQuery(JdbcAccessMethod.java:240)
        at edu.washington.escience.myria.operator.DbQueryScan.fetchNextReady(DbQueryScan.java:176)
        at edu.washington.escience.myria.operator.Operator.nextReady(Operator.java:336)
        at edu.washington.escience.myria.operator.RootOperator.fetchNextReady(RootOperator.java:120)
        at edu.washington.escience.myria.operator.Operator.nextReady(Operator.java:336)
        at edu.washington.escience.myria.parallel.LocalFragment.executeActually(LocalFragment.java:462)
        at edu.washington.escience.myria.parallel.LocalFragment.access$400(LocalFragment.java:47)
        at edu.washington.escience.myria.parallel.LocalFragment$1.call(LocalFragment.java:199)
        at edu.washington.escience.myria.parallel.LocalFragment$1.call(LocalFragment.java:179)
        at edu.washington.escience.myria.util.concurrent.ExecutableExecutionFuture.call(ExecutableExecutionFuture.java:193)
        ... 4 more
    Caused by: edu.washington.escience.myria.DbException: ErrorCode: 0, SQLState: 42P01, Msg: ERROR: relation "public:adhoc:data" does not exist
  Position: 164
        ... 15 more
    Caused by: org.postgresql.util.PSQLException: ERROR: relation "public:adhoc:data" does not exist
  Position: 164
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2198)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1927)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:549)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:419)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:304)
        at edu.washington.escience.myria.accessmethod.JdbcAccessMethod.tupleBatchIteratorFromQuery(JdbcAccessMethod.java:237)
        ... 13 more
    Suppressed: edu.washington.escience.myria.DbException: Worker #3 failed: ErrorCode: 0, SQLState: 42P01, Msg: ERROR: relation "public:adhoc:data" does not exist
  Position: 164
        at edu.washington.escience.myria.util.ErrorUtils.mergeSQLException(ErrorUtils.java:60)
        at edu.washington.escience.myria.accessmethod.JdbcAccessMethod.tupleBatchIteratorFromQuery(JdbcAccessMethod.java:240)
        at edu.washington.escience.myria.operator.DbQueryScan.fetchNextReady(DbQueryScan.java:176)
        at edu.washington.escience.myria.operator.Operator.nextReady(Operator.java:336)
        at edu.washington.escience.myria.operator.RootOperator.fetchNextReady(RootOperator.java:120)
        at edu.washington.escience.myria.operator.Operator.nextReady(Operator.java:336)
        at edu.washington.escience.myria.parallel.LocalFragment.executeActually(LocalFragment.java:462)
        at edu.washington.escience.myria.parallel.LocalFragment.access$400(LocalFragment.java:47)
        at edu.washington.escience.myria.parallel.LocalFragment$1.call(LocalFragment.java:199)
        at edu.washington.escience.myria.parallel.LocalFragment$1.call(LocalFragment.java:179)
        at edu.washington.escience.myria.util.concurrent.ExecutableExecutionFuture.call(ExecutableExecutionFuture.java:193)
        ... 4 more
    Caused by: edu.washington.escience.myria.DbException: ErrorCode: 0, SQLState: 42P01, Msg: ERROR: relation "public:adhoc:data" does not exist
  Position: 164
        ... 15 more
    Caused by: org.postgresql.util.PSQLException: ERROR: relation "public:adhoc:data" does not exist
  Position: 164
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2198)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1927)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:549)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:419)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:304)
        at edu.washington.escience.myria.accessmethod.JdbcAccessMethod.tupleBatchIteratorFromQuery(JdbcAccessMethod.java:237)
        ... 13 more
    Suppressed: edu.washington.escience.myria.DbException: Worker #4 failed: ErrorCode: 0, SQLState: 42P01, Msg: ERROR: relation "public:adhoc:data" does not exist
  Position: 164
        at edu.washington.escience.myria.util.ErrorUtils.mergeSQLException(ErrorUtils.java:60)
        at edu.washington.escience.myria.accessmethod.JdbcAccessMethod.tupleBatchIteratorFromQuery(JdbcAccessMethod.java:240)
        at edu.washington.escience.myria.operator.DbQueryScan.fetchNextReady(DbQueryScan.java:176)
        at edu.washington.escience.myria.operator.Operator.nextReady(Operator.java:336)
        at edu.washington.escience.myria.operator.RootOperator.fetchNextReady(RootOperator.java:120)
        at edu.washington.escience.myria.operator.Operator.nextReady(Operator.java:336)
        at edu.washington.escience.myria.parallel.LocalFragment.executeActually(LocalFragment.java:462)
        at edu.washington.escience.myria.parallel.LocalFragment.access$400(LocalFragment.java:47)
        at edu.washington.escience.myria.parallel.LocalFragment$1.call(LocalFragment.java:199)
        at edu.washington.escience.myria.parallel.LocalFragment$1.call(LocalFragment.java:179)
        at edu.washington.escience.myria.util.concurrent.ExecutableExecutionFuture.call(ExecutableExecutionFuture.java:193)
        ... 4 more
    Caused by: edu.washington.escience.myria.DbException: ErrorCode: 0, SQLState: 42P01, Msg: ERROR: relation "public:adhoc:data" does not exist
  Position: 164
        ... 15 more
    Caused by: org.postgresql.util.PSQLException: ERROR: relation "public:adhoc:data" does not exist
  Position: 164
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2198)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1927)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:549)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:419)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:304)
        at edu.washington.escience.myria.accessmethod.JdbcAccessMethod.tupleBatchIteratorFromQuery(JdbcAccessMethod.java:237)
        ... 13 more

BrandonHaynes commented 9 years ago

A workaround to this issue is to force a shuffle and store the result:

data = scan(data);
store(data, data, [x]);

senderista commented 8 years ago

@jingjingwang can you take a look at this? Since we're using LOAD() now in all of our example code this will be hugely confusing for users. Hopefully it's an easy fix...

jingjingwang commented 8 years ago

I took a look and let me explain what happened. There are two operator encodings in MyriaX, TableScanEncoding and DbQueryScanEncoding. The 1st one only scans a table, the 2nd one has a SQL statement than can do arbitrary query on multiple tables. The query plan generated by raco in this case uses DbQueryScanEncoding to push some optimizations down (I think).

For TableScanEncoding, MyriaX only assign the operator to workers that stored the relation, but for DbQueryScanEncoding, it's hard to do the same thing since -- 1. we would need a SQL parser on MyriaX (to get the list of touched relations), but MyriaX is not supposed to do the query parsing job, and 2. even if we have a list of accessed relations, should we only assign the operator to workers that have all the relations?

PS if we remove the line R2 = [from R1 emit count(*)]; then it works since raco then generates TableScanEncoding instead of DbQueryScanEncoding. Also if we click the option "Disable pushing computation into database" then it also works.

jingjingwang commented 8 years ago

So I believe a workaround could be "Disable pushing computation into database". If we want to go further, maybe we can add a field into DbQueryScanEncoding for the list of touched relations in its SQL, and it'll be great if raco can fill it in. Otherwise we need to have a SQL parser in MyriaX.

Just realized this: for the parser, we would need to call each DBMS to let it "explain" the query, then parse the result and find all the scans. A singleton parser won't work.

BrandonHaynes commented 8 years ago

Is creating an empty relation on all non-load workers not a potential workaround?

jingjingwang commented 8 years ago

I think it'll also be a workaround. (although doesn't sound very correct since they shouldn't get the query)

senderista commented 8 years ago

Yes, I'm concerned about that approach corrupting the semantics of our metadata. We would no longer be able to assume that a relation partitioned on a given worker actually had data on that worker without checking the size of the partition.

It sounds like fixing this before the demo isn't realistic and we should pursue one of these workarounds:

Insert a superfluous partitioning expression into all store() statements (and continually remind users to do this)
Change myria-web to make "Disable pushing computation into database" the default

Unless 2) significantly slows a lot of queries, I lean toward 2). Even if it does slow down some queries, existing queries will still be correct when we change the myria-web default, and we should be able to do that pretty soon. What do the rest of you think?

jingjingwang commented 8 years ago

Yep, I also vote for 2) for the demo. I think we will not be demonstrating big queries anyway. Although we need to check if they can still pass with 2).

uwescience / myria

Aggregation fails on dataset loaded via `load` #758