locationtech-labs / geopyspark

GeoTrellis for PySpark
Other
179 stars 59 forks source link

SpaceTimePartitionStrategy #643

Closed jbouffard closed 6 years ago

jbouffard commented 6 years ago

This PR adds the SpaceTimePartitionStrategy to to the list of PartitionStrategies. This will allow for better partitioning of SPACETIME layers. The SpaceTimePartitioner will try to sort each element of the layer based on its spatial location and its time. Temporal sorting can be controlled both by a unit of time (ie. DAYS, MONTHS, etc) and by a resolution (ie. 1 day, 3 weeks, etc).

This PR resolves #642

jdries commented 6 years ago

This looks as expected. For me, the most difficult parameter to use is 'bits' and it's also not entirely clear to me how it relates to num_partitions. Of course, this parameter also exists in plain geotrellis, so it makes sense to also expose it here.

jbouffard commented 6 years ago

@jdries The bits parameter is kind of confusing, and it's something we should explain better in the documentation.

In order to determine what partition an element belongs to, we first calculate its z-value using a Z-order curve which in this case is the interleaving of the binary values of the col, row, and instant of the Tile. This z-value represents an area that is unique to those values. However, this is a problem as we'd like to group values who are close to each other both spatially and temporally in the same partition, and it's difficult to achieve this when all of the z-values are unique. Thus, we perform a right shift on the resulting z-value https://github.com/locationtech-labs/geopyspark/blob/master/geopyspark-backend/geotrellis/src/main/scala/geopyspark/geotrellis/SpatialPartitioner.scala#L21

By dropping N number of bits from each z-value, we can create larger groupings of values since those elements that are near each other will have their remaining bits in common. Therefore, the larger bits is, the larger the groupings of values will be. bits controls how many elements will (ideally) be in each partition.

The shifted z-index is used to assign its associated value to a partition based on num_partitions by doing z-index % num_partitions. Which will try to distribute groupings evenly between the partitions.

I hope I was able to answer your question, but please let me know if it wasn't clear. @echeipesh Do you have anything you'd like to add?

jbouffard commented 6 years ago

@jpolchlo I updated the docstring. How does it look now?

jpolchlo commented 6 years ago

I feel as if there are more words there, but I'm only getting a slightly better understanding of what the parameter is there for. I'm a neophyte in regards to spatial partitioning, but not so dumb that I shouldn't be able figure it out with a few good sentences. Is there a way to describe why I would change the bits parameter? What would the tradeoffs be? Can you impart any intuition about this? (Viz. the connection between the z-curve and the quadtree?)

jamesmcclain commented 6 years ago

the connection between the z-curve and the quadtree?

I am typing this without looking at the underlying code in question, but typically you would expect one bit per spatial dimension per division. In the 2D case, the number of bits divided by two is the number of divisions (levels of the quadtree) that are payed-attention-to.

jbouffard commented 6 years ago

I'm having some trouble explaining what bits does in a concise manner, so I'm thinking about just merging this PR and then make an issue for it. Would that be okay with you @jpolchlo?

It might also be good to make the same issue in GT since I have to yet find any documentation there that explains what bits does well.

jpolchlo commented 6 years ago

I'm ok with that. Make an issue!