uwescience / myria

Myria is a scalable Analytics-as-a-Service platform based on relational algebra.
myria.cs.washington.edu
Other
113 stars 46 forks source link

Write balanced spatial partitioning query in MyriaL #813

Open senderista opened 8 years ago

senderista commented 8 years ago

We have a case (the cosmo8 dataset) where a spatial dataset is extremely skewed (always hashes to a single partition), making queries very difficult. If we had a MyriaL query that applied a balanced spatial partitioning technique like kd-trees or octrees, we would be able to pre-partition the data to preserve locality while avoiding skew. (If necessary, this logic could later be implemented as a standalone operator.)

We should be able to implement this query as a do...while loop, with the termination condition expressed as point density or fraction of total points. We could initially implement the partitioning as repeated bisection of the space, cycling through each dimension as in kd-trees, but simply bisecting along the midpoint coordinates of each cell. Later we could turn this into a proper kd-tree, bisecting along the median coordinates in each dimension (possibly just using the median of a random sample of k points rather than trying to run an O(n) distributed median algorithm).

senderista commented 8 years ago

Elaborating on the sample median approach above, I think this could be implemented with the MyriaL equivalent of SampleScan(k)->OrderBy()->Limit(k/2)->OrderBy(DESC)->Limit(1):

const sample_size: 1000;
P_X = [from points emit x];
S_X = samplescan(P_X, sample_size, WoR);
M = select M_1.x from (
  select x from S_X
  order by x
  limit sample_size/2) M_1
order by x DESC
limit 1;

Or something like that ;-)