Open senderista opened 8 years ago
Elaborating on the sample median approach above, I think this could be implemented with the MyriaL equivalent of SampleScan(k)->OrderBy()->Limit(k/2)->OrderBy(DESC)->Limit(1)
:
const sample_size: 1000;
P_X = [from points emit x];
S_X = samplescan(P_X, sample_size, WoR);
M = select M_1.x from (
select x from S_X
order by x
limit sample_size/2) M_1
order by x DESC
limit 1;
Or something like that ;-)
We have a case (the
cosmo8
dataset) where a spatial dataset is extremely skewed (always hashes to a single partition), making queries very difficult. If we had a MyriaL query that applied a balanced spatial partitioning technique like kd-trees or octrees, we would be able to pre-partition the data to preserve locality while avoiding skew. (If necessary, this logic could later be implemented as a standalone operator.)We should be able to implement this query as a
do...while
loop, with the termination condition expressed as point density or fraction of total points. We could initially implement the partitioning as repeated bisection of the space, cycling through each dimension as in kd-trees, but simply bisecting along the midpoint coordinates of each cell. Later we could turn this into a proper kd-tree, bisecting along the median coordinates in each dimension (possibly just using the median of a random sample ofk
points rather than trying to run anO(n)
distributed median algorithm).