Spatial Sort for VectorJoin

echeipesh commented 7 years ago

VectorJoin promises ability to do a spatial predicate join over two large RDDs of geometries. But it has pre-condition that the geometries in either RDDs are spatially correlated per partition. That is records in each partition are reasonably clustered together. When this is not the case it devolves into a partition-wise cross-product between two RDDs.

Often this pre-condition will be met incidentally, for instance when loading the RDDs from a distributed database that has a spatial index, or if the file read order happens to impart spatial correlation due to naming. However, when this is not the case there is currently no reasonable way to spatially sort the records.

Ideally some sort of order can be applied without iterating over the records.

A possible solution may be to implement a spark Partitioner that essentially implements a simple version of a tiered geospatial index, like XZ-order curve used one used by GeoWave or GeoMesa.

Where partitions would represent sequenced ties of a quad-tree.

Partition 0 would container geometries covering world extent
Partitions 1-4 would contain geometries covering a quadrant at pyramid level 1
Partitions 5-21 would contain geometries covering a quadrant at pyramid level 2
etc

Partition 0 would likely be interested with every partition on opposing side but more resolute partitions can be increasingly more selective. Empty partitions would certainly result, but their existence does not dramatically impact performance if logging is tuned correctly.

As long as its possible to draw a bounding box around the dataset and see a reasonably uniform density of geometries, this approach should work well. Dealing with very skewed datasets is a harder problem (imagine a dataset with two clusters at opposite sides of the bounding box).

References: XZ-Order curve used by GeoMesa: https://github.com/locationtech/geomesa/tree/master/geomesa-z3/src/main/scala/org/locationtech/geomesa/curve http://www.dbs.ifi.lmu.de/Publikationen/Papers/SSD-XZ-Order.final.pdf

TieredSFCIndex used by GeoWave: https://github.com/ngageoint/geowave/blob/master/core/index/src/main/java/mil/nga/giat/geowave/core/index/sfc/tiered/TieredSFCIndexStrategy.java

If either of these implementations are suitable it should be considered if they should be added to https://github.com/locationtech/sfcurve and brought in as a dependency in that manner.

lossyrob commented 7 years ago

Related to #2116

lossyrob commented 7 years ago

Relevant https://github.com/locationtech/geomesa/pull/1612

locationtech / geotrellis

Spatial Sort for VectorJoin #2136