uwescience / myria

Myria is a scalable Analytics-as-a-Service platform based on relational algebra.
myria.cs.washington.edu
Other
112 stars 46 forks source link

Partitioning error during ingest unioned with an existing relation #845

Open dhutchis opened 8 years ago

dhutchis commented 8 years ago

@shrjain and I found the following query on the example TwitterK dataset that fails. (Replace the IP address with that of the demo myria service if it changes. Or pick another dataset you can load from either the local filesystem or a url.) It is simplified from a real query we are running on genomic data.

A = load("http://54.235.58.81:8753/dataset/user-public/program-adhoc/relation-TwitterK/data?format=csv", csv(schema(a:int, b:int),skip=1));
store(A, tmpA, [b]);

A = scan(tmpA);
B = load("http://54.235.58.81:8753/dataset/user-public/program-adhoc/relation-TwitterK/data?format=csv", csv(schema(a:int, b:int),skip=1));
B = B + A;
store(B, tmpB, [b]);

It fails because it throws an exception on this line of QueryConstruct.

The line checks that all FileScan operators are inside fragments that are assigned to no more than one worker. However, the FileScan for this query is placed inside a fragment that is assigned to more than one worker, because it includes logic for the UnionAll.

Likely there is a problem with the code that assigns fragments to workers. The problem goes away if you remove the "partition on [b]" from the two store statements. The hash partitioning is messing up the assignment somehow.

senderista commented 8 years ago

Removing the hash partitioning works because STORE used with LOAD partitions a relation on a single worker by default (this is a bug: https://github.com/uwescience/raco/issues/494).

senderista commented 8 years ago

Interestingly, running this query on the demo server seems to hang forever (in the attached screenshot it's been running over 2 hours):

image