locationtech-labs / geopyspark

GeoTrellis for PySpark
Other
179 stars 59 forks source link

Use geotrellis-contrib to perform windowed reads on remote rasters #689

Open echeipesh opened 5 years ago

echeipesh commented 5 years ago

A typical GPS workflow might start like this:

pats = ["s3://bucket/one.tiff"]
raster_layer = gps.RasterLayer.from_numpy_rdd(gps.LayerType.SPATIAL, gps.rasterio.get(paths))
tiled_layer = raster_layer.tile_to_layout(gps.GlobalLayout(256), target_crs=3857)
tiled_layer.count()

This is great for various reasons but has following cost:

Prototyped here is an alternative workflow that avoids the high cost of spark shuffles on pixel tiles in this process: https://github.com/geotrellis/geotrellis-contrib/blob/demo/wsel/wse/src/main/scala/Main.scala

This can be encapsulated in a GeoPySpark operation that combines the read and tile_to_layout (with optional reprojection) steps to produce a TiledRasterLayer.

Ideally API can be similar to tile_to_layout perhaps:

    def read_to_layout(paths,
                       layout=LocalLayout(),
                       target_crs=None,
                       resample_method=ResampleMethod.NEAREST_NEIGHBOR,
                       partition_strategy=None,
                       reader=ReadMethod.GEOTRELLIS):

There is an outstanding question of what should happen to merge order. I think currently any overlapping raster is going to get merged in arbitrary order as result of tile_to_layout operation and that would be an easy default to start from here.

jbouffard commented 5 years ago

This gist contains some more mock Python API/logic for the read_to_layout function. One issue that was raised when writing this code was with geotrellis-contrib's inability to create SpaceTime RDDs; as the current GPS API allows users to specify the LayerType when reading in data. @echeipesh do you think it'd be worth refactoring the geotrellis-contrib code so that it can work on a generic K? Or should we just implement separate logic to deal with that case?

With the read_to_layout method, we'll have three different ways of reading in raw geospatial data into an RDD:

Not Tiled:

Tiled:

I think it'd make sense then to combine gps.geotiff.get and gps.rasterio.get into one function. By having one function, we'll decrease the surface area of GPS' API and simplify workflows. This could also be called, get and it can reside in gps.raster_source. The user could switch between GT and GDAL reading via the read_method parameter.

The one issue with this from an API perspective is the number of parameter's gps.geotiff.get has. To get around this, we could place them in a namedtuple:


class GeoTiffOptions(namedtuple("GeoTiffOptions", 'chunk_size partition_bytes time_tag time_format delimiter s3_client s3_credentials')):
    __slots__ = []

And then pass in those options as a parameter to get


def get(layer_type,
        paths,
        tile_size=MAX_TILE_SIZE,
        target_crs=None,
        geotrellis_options=GeoTiffOptions(),
        read_method=ReaderMethod.GEOTRELLIS)
echeipesh commented 5 years ago
@echeipesh do you think it'd be worth refactoring the geotrellis-contrib code so that it can work on a generic K? 

Yes, that's one of the next steps in that project but its probably going to take a little bit to design that API. I don't think lack of that capability should block this feature. For now I think it would be acceptable tor throw NotImplementedError if asked for temporal layer.

I think it'd make sense then to combine gps.geotiff.get and gps.rasterio.get into one function.

Agreed that it makes sense to group this functionality by tiled and not-tiled use case. But grouping rasterio and GeoTrellis read methods could potentially be more confusing. You're going to end up with one hell of a doc-string on the function trying to explain differences in behavior. Also the ways in which either method is going to fail and needs to be configured is going to be significantly different. rasterio is also an optional dependency and at least having it in its own module gives people some way to reason when they've need to have installed it without being surprised by an module not found error. You may have better sense of expectations here but it doesn't feel like it would simplify the the API that much for all the questions it introduces.

Maybe another way is to work towards unifying gps.geotiff.get and read_to_layout? In the end gps.geotiff.get will likely end up being implemented using RasterSources as well and then it would be the difference between asking for tiled and untiled input.

Also another differences is that gps.geotiff.get actually performs the listing where as read_to_layout and rasterio.get accept a list of URIs which must have been discovered beforehand.