Virtual Layer Mosaic - Githubissues

echeipesh commented 6 years ago

In order to work with multiple raster sources to produces a single output its required that they their pixel values are co-registered, that is they share:

geospatial projection
spatial resolution
tiling scheme

Typically this requirement is handled by Extract Transform Load (ETL) step, which transforms input rasters to shared layout which defines above parameters. While this approach is valid and common it makes too many arbitrary decisions before starting the reading step.

The single ETL process should be broken up into four stages, which allow incremental refinement and modification of the process based on metadata.

A better approach is to query or read only the raster metadata and associate it with the source of raster cells in a RasterRef (working name). This would allow performing, splitting, filtering and joining on the metadata records, reading the cells only when the shape and content of the final output known.

The intermediate data structure represents something like virtual mosaic, like a GDAL VRT that can be further refined in Spark memory before being reified.

The expected stages to support this feature are:

Query

Identify potential rasters and optionally read their metadata if not given in query response.

Design Question

What allows the RasterRef to have user defined metadata values. Is it type parameter M or something else?

Filter

Remove or modify the raster references based on metadata.

Split

Partition references to larger rasters into sets of possibly overlapping windows. In this step we may end up with records that reference different regions of the same file or table.

Windows may be overlapping if its desirable to read a buffer of pixels around a region, allowing that region to be correctly resampled or reprojected.

Join

Once the raster refs have been split into windows according to some LayoutDefintion they can be keyed and joined on those keys.

The intent of the join is either to assemble different rasters into MultibandTile or to select a subset of the sources based on metadata.

Design Questions

Does the user have to keep track of the constraint that keys are from same LayoutDefinition or is there some class can enforce this per result?

Read

Reading the raster cells described by each window of the RasterRef into MultibandTile.

Reading the raster cells has room for optimization. In the common case of GeoTiff we know the file has a segment structure and reading any bounding box will ultimately be expressed as reading segments.

When reading a single GeoTif file segment it may intersect multiple target windows. Knowing all of the target windows allows us to read the segment once and partially populate each target window until its complete.

The same logic extends to GeoTrellis Avro layers, which are tiled.

Design Questions

Should read operations group windows or should there be a pre-grouped structure that represents RasterRef and a set of windows?
Per-row construction is possibly more compatible with DataFrame view of the world.

Transform

Once the RasterRefs have been read into RDD[(SpatialKey, MultibandTile)] or RDD[(SpatialKey, BufferedTile[MultibandTile])] we should be able to apply standard GeoTrellis operations.

Design Question

Is MultibandTile actually good here? What if we joined two datasets of differing resolution?

echeipesh commented 6 years ago

My guess is thats going to be incredibly difficult to get this API even close to the right shape from the get-go. Instead it makes sense to tackle this feature one use case at a time and iterate on emerging API as we add complexity. One stumbling block for this approach in the past has been the concern with API stability in the master branch.

I would advocate to contain API stability concerns in the release branches (2.0 and 2.1) and merge iterative PRs more aggressively either to master or develop branch.

metasim commented 6 years ago

@echeipesh is this a use case for the geotrellis-sandbox?

metasim commented 6 years ago

A capability I'd like is the ability to have different backing stores, so that RasterFrames can implement a VirtualLayerMosaic trait and benefit from things like a tile server built for it. Assuming such a requirement wouldn't levy undue complexity if known from the start.

Keep in mind that going from a DataFrame back to RDD (TileLayerRDD and friends are already supported) is not a problem.

locationtech / geotrellis