Open echeipesh opened 6 years ago
My guess is thats going to be incredibly difficult to get this API even close to the right shape from the get-go. Instead it makes sense to tackle this feature one use case at a time and iterate on emerging API as we add complexity. One stumbling block for this approach in the past has been the concern with API stability in the master
branch.
I would advocate to contain API stability concerns in the release branches (2.0
and 2.1
) and merge iterative PRs more aggressively either to master
or develop
branch.
@echeipesh is this a use case for the geotrellis-sandbox
?
A capability I'd like is the ability to have different backing stores, so that RasterFrames can implement a VirtualLayerMosaic
trait and benefit from things like a tile server built for it. Assuming such a requirement wouldn't levy undue complexity if known from the start.
Keep in mind that going from a DataFrame back to RDD (TileLayerRDD
and friends are already supported) is not a problem.
In order to work with multiple raster sources to produces a single output its required that they their pixel values are co-registered, that is they share:
Typically this requirement is handled by Extract Transform Load (ETL) step, which transforms input rasters to shared layout which defines above parameters. While this approach is valid and common it makes too many arbitrary decisions before starting the reading step.
The single ETL process should be broken up into four stages, which allow incremental refinement and modification of the process based on metadata.
A better approach is to query or read only the raster metadata and associate it with the source of raster cells in a
RasterRef
(working name). This would allow performing, splitting, filtering and joining on the metadata records, reading the cells only when the shape and content of the final output known.The intermediate data structure represents something like virtual mosaic, like a GDAL VRT that can be further refined in Spark memory before being reified.
The expected stages to support this feature are:
Query
Identify potential rasters and optionally read their metadata if not given in query response.
Design Question
RasterRef
to have user defined metadata values. Is it type parameterM
or something else?Filter
Remove or modify the raster references based on metadata.
Split
Partition references to larger rasters into sets of possibly overlapping windows. In this step we may end up with records that reference different regions of the same file or table.
Windows may be overlapping if its desirable to read a buffer of pixels around a region, allowing that region to be correctly resampled or reprojected.
Join
Once the raster refs have been split into windows according to some
LayoutDefintion
they can be keyed and joined on those keys.The intent of the join is either to assemble different rasters into
MultibandTile
or to select a subset of the sources based on metadata.Design Questions
LayoutDefinition
or is there some class can enforce this per result?Read
Reading the raster cells described by each window of the
RasterRef
intoMultibandTile
.Reading the raster cells has room for optimization. In the common case of GeoTiff we know the file has a segment structure and reading any bounding box will ultimately be expressed as reading segments.
When reading a single GeoTif file segment it may intersect multiple target windows. Knowing all of the target windows allows us to read the segment once and partially populate each target window until its complete.
The same logic extends to GeoTrellis Avro layers, which are tiled.
Design Questions
RasterRef
and a set of windows?Transform
Once the
RasterRefs
have been read intoRDD[(SpatialKey, MultibandTile)]
orRDD[(SpatialKey, BufferedTile[MultibandTile])]
we should be able to apply standard GeoTrellis operations.Design Question
MultibandTile
actually good here? What if we joined two datasets of differing resolution?