Better Spatial Join Implementation for Magellan

The ZOrderCurve Index computed by Magellan today subdivides space into equal sized grids of a given precision. This can easily be changed to compute curves of a precision between min, max (or other requirements like I want atmost n curves taken together to contain the geometry) The advantage of doing this is that the index is easy to compute and fits into memory on a single node more easily. The disadvantage is that we can no longer use the underlying join implementations in Spark at the Physical Plan layer e.g in point within polygon computation, we'd end up with a single curve of precision say 001100 for the point but maybe a couple of curves of the form 001 and 0100 for the polygon.

A hash join would now not work. Instead we'd need to construct a trie and walk down that data structure to figure out where to send each point. If the trie is small enough, this is easy.. but if the trie is bigger, we need the trie to be distributed and a partitioning scheme that can be efficient We also need logic at the physical layer to determine if the trie should be distributed or localized on the driver (i.e. shuffled or streaming join)

All this makes the implementation a lot more complex so it has been lower priority.

The main reasons to implement a better join would be:

it is easier to generalize this to all types of Joins apart from point within polygon
it might be cheaper to compute the index in some cases
today users of Magellan need to know the exact precision to use: the above algorithm would be more forgiving, you can specify a max # of curves for example which is easier to estimate

harsha2010 / magellan

Better Spatial Join Implementation for Magellan #208