locationtech / rasterframes

Geospatial Raster support for Spark DataFrames
http://rasterframes.io
Apache License 2.0
243 stars 46 forks source link

Implement buffered tiles #441

Closed metasim closed 2 years ago

jpolchlo commented 4 years ago

This issue has been sitting for some time and continues to be an important milestone. Rather than wait for the work to be done for us, we'd like to propose a work plan for getting this done. Largely, this is based on our experience adding an experimental tensor type to a branch of RF. A buffered version of that tensor was also provided, but it ends up being unclear if the additional dimension of the tensor type was particularly useful. However, the experience of adding a buffered type was instructive, and we believe that introducing a buffered type will enable some valuable operations.

The main assumption behind this work is that we need a new BufferedTile type and a corresponding UDT. The BufferedTile provided by Geotrellis is nearly worthless, so we should create a more featureful replacement. This can be implemented directly in RF to start, and back-ported to GT if it's deemed worthy. This base class will provide the means to work with both Tiles and BufferedTiles for use in local map algebra operations, and convolution methods should be provided for focal operations.

Expression nodes will need to be modified to allow for the two types of tiles to interact easily. For instance, BinaryLocalRasterOp can be provided with more overloads to its op method, and the eval functions can be extended to do the work with the new type in whichever combination is required.

We'll also need to extend readers to produce buffered tiles.

Successful implementation of this feature should be accompanied by some basic focal operations.

metasim commented 4 years ago

@jpolchlo This is fantastic. Thanks for getting the ball rolling on this. It's been on my todo for way too long.

I would propose that we seriously consider all tiles "buffered tiles", with the default buffer being zero. This would pretty much allow all operations/Expressions to continue working unchanged. However, I'm assuming a fairly naive representation of a buffered tile, where there's equal buffer amount around the whole tile, and we're only talking about adding a single field to the UDT. The GeoTrellis BufferedTile uses a GridBounds[Int] to describe the buffering. Not sure I understand the semantics of it, and the implications of it being asymmetric.

How does this line of thinking fall apart?

Another option is to use a structural typing approach like we do for "ProjectedRasters", but I feel this is much more an implementation detail, and deserves to be hidden behind the UDT curtain.

The initial way to generate the buffered tiles would be to add a parameter to spark.read.raster. Then I suppose we'd want to look at rf_reproject and other operations where you might want to go from unbuffered to buffered whenever interpolation is involved.

Would you mind linking here your {Buffered}TensorUDT implementations here? Pointing out salient sections would be a bonus.

metasim commented 4 years ago

BTW, I'm hoping to get GT 3.3.0 merged in soon, if that's relevant.

metasim commented 4 years ago

A TileUDT is logically an aggregate of TileDataContext and Cells:

object TileUDT {
    ...
    override val schema: StructType = StructType(Seq(
      StructField("cell_context", schemaOf[TileDataContext], true),
      StructField("cell_data", schemaOf[Cells], false)
    ))
    ...
}

case class TileDataContext(cellType: CellType, dimensions: Dimensions[Int])
case class Cells(data: Either[Array[Byte], RasterRef])

An approach might be to change to:

case class TileDataContext(cellType: CellType, dimensions: Dimensions[Int], bufferSize: Short)

(again, assuming a single value could capture buffer definition)

and then work through the down-stream effects of Cells and InternalRowTile.

Then create something like rf_buffer[N: Numeric](tile: Tile, size: Short, fillValue: N) to enable basic unit tests.

metasim commented 4 years ago

@echeipesh Interested in your advice/wisdom on implementation paths, if you have any.

echeipesh commented 4 years ago

I don't feel like I grok all the implications well enough to have a hard opinion.

At first sight wrapping things in TileUDT makes sense because my base expectation is that buffered tiles are just tiles and should be usable in all contexts in which tiles are. Consequently you end up with three options for how to handle them:

My impulse is to work this out in GeoTrellis Tile hierarchy by adding BufferTile witch extends Tile and then try to bubble it up to Spark Expressions.

One thing that I'm thinking about is that having a distinct BufferTileUDT would allow some type checking to happen during the query planning. For instance using rf_focal_max on TileUDT would throw an exception and would work on BufferTileUDT. Similarly the logic of adding Tile and BufferTile could be reflected in the DataFrame types.

metasim commented 4 years ago

One thing that I'm thinking about is that having a distinct BufferTileUDT would allow some type checking to happen during the query planning.

Indeed, that would be a benefit. However, is not rf_focal_max valid as long as the kernel (neighborhood, or whatever) doesn't extend beyond the "natural" boundaries of the tile? IOW, there are at least some cases when focal operations can be done on certain tiles.

Another option is to make use of "structural typing", as is done with representing a ProjectedRasterTile.... If you have an Extent, CRS and Tile organized according to the CatalystSerializer schema for the type, then it is interpreted as such and type checked accordingly. That's what's going on with the conformsTo operations.

Not suggesting we use structural typing over UDTs in this case (I think it might be folly), but it's in the bag of tricks.

echeipesh commented 4 years ago

I think what you're saying is that there are two modes of using rf_focal_max

  1. Your tiles are from contiguous but tiled layer, any calls to rf_focal_max that exhaust per-allocated buffer must error because you're losing information.

  2. Your tiles are "just tiles, man" and the user for whatever reason is OK with having incomplete focal kernels, accepting the resulting artifacts.

The real danger is that without some kind of hard boundary the code may transition from case 1 to case 2 without user realizing. For instance the user can perform a buffered read, producing buffered tiles, then combine them with some non-buffered tiles (which must erase tile buffers) then calls rf_focal_max. What was the user expectation in that case? I would say that it was case 1, but the result will be case 2.

Whats worse is that difference in output is actually quite subtle. You're not going to produce NODATA regions, just slightly different/incorrect cell values which will propagate to downstream calculations. That's a 3AM laptop through a window moment waiting to happen.