harsha2010 / magellan

Geo Spatial Data Analytics on Spark
Apache License 2.0
534 stars 149 forks source link

Better Spatial Join Implementation for Magellan #208

Open harsha2010 opened 6 years ago

harsha2010 commented 6 years ago

The ZOrderCurve Index computed by Magellan today subdivides space into equal sized grids of a given precision. This can easily be changed to compute curves of a precision between min, max (or other requirements like I want atmost n curves taken together to contain the geometry) The advantage of doing this is that the index is easy to compute and fits into memory on a single node more easily. The disadvantage is that we can no longer use the underlying join implementations in Spark at the Physical Plan layer e.g in point within polygon computation, we'd end up with a single curve of precision say 001100 for the point but maybe a couple of curves of the form 001 and 0100 for the polygon.

A hash join would now not work. Instead we'd need to construct a trie and walk down that data structure to figure out where to send each point. If the trie is small enough, this is easy.. but if the trie is bigger, we need the trie to be distributed and a partitioning scheme that can be efficient We also need logic at the physical layer to determine if the trie should be distributed or localized on the driver (i.e. shuffled or streaming join)

All this makes the implementation a lot more complex so it has been lower priority.

The main reasons to implement a better join would be: