locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.34k stars 360 forks source link

Spatially partition RDDs of vector data #2116

Open lossyrob opened 7 years ago

lossyrob commented 7 years ago

We have a VectorJoin which can efficiently join RDDs of vectors, with the caveat that the RDDs must be spatially partitioned. Currently, we don't have a very good way to do that. This issue is solved when we have a good spark Partitioner that can spatially partition RDDs of Geometries or Feature data, such that VectorJoin is as efficient as possible.

lossyrob commented 7 years ago

To clarify my thinking on this:

If partitions of geometries are not spatially partitioned, then the extent of any one partition is likely to be a large percentage of the extent of the dataset as a whole. The line here will tend to always be true as the partition extents tend to overlap each other, for the metapred that the VectorJoin uses, defined here.. This will cause the number of partitions shuffled to tends toward a complete shuffle (like a full cross join), i.e. shuffling each partition to each node in the cluster. So for an N node cluster, we send at most, and tending towards, N-1 copies of the data across the network of the cluster for this join.

A spatial partition of the RDD should result in shuffling each piece of data to at most 1 node. There should not be multiple copies of data elements to multiple nodes; either the geometries would stay on the node they happen to reside prior to the spatial partitioning, or they would be sent to a new partition that resides on another node. After that single shuffle, the join should be much more selective in what partitions it shuffles.

If we have two RDDs that are partitioned with the same partitioner, then we can guarantee that the join will shuffle no partitions. So for each partition step, we would be shuffling at most 1 copy of the data, and the join would shuffle nothing - this is as opposed to shuffling potentially N-1 copies of the data across the wire for on of the RDDs during the join.

There are potential occasions where either the geometry data is not uniformly distributed (in which case the join happens to not tend towards N-1 shuffle copies), or the RDD that is being shuffled has significantly less data than the other. For the case where both RDDs are of relatively equal size, and we can assume a random distribution of geometries across partitions, then for the reasons given above I am lead to believe the partition-before-join approach will be optimal.

@jamesmcclain let me know if this reasoning make sense, or if there's a flaw in my thinking.

jamesmcclain commented 7 years ago

Okay, sounds good.

lossyrob commented 6 years ago

Moving to 2.0; ClipToGrid gives partial functionality in the 1.2 release