njtierney / geotargets

Targets extensions for geospatial data
https://njtierney.github.io/geotargets/
Other
51 stars 4 forks source link

Some thoughts on specifying tile size #83

Closed njtierney closed 5 days ago

njtierney commented 2 months ago

It seems there could be 3 practical ways to specify tile size

  1. pixel/dimension (e.g., 2x2, 3x3)
  2. Block size - use the block size to create better/more efficient tile size (e.g., if you have a 512 block size for a giant raster, it might make best computational sense to stick to tiles that are multiples of block size, since those blocks will need to be read in anyway)
  3. Number of tiles - I want 9 tiles, for example. This might cut against the principles of block size - e.g., if you split a raster into 9 tiles, but the blocksize ends up with 4 tiles, then you might not be getting decent gains in read/write and operations.
Aariq commented 1 month ago

For option 2 we could default to using terra::fileBlocksize() if ncol and nrow aren't supplied to tar_terra_tiles(). We could also replace ncol and nrow with an argument template that gets passed to the y arg of getTileExtents().

library(terra)
#> terra 1.7.78
x <- rast(system.file("ex/elev.tif", package = "terra"))
getTileExtents(x, y = fileBlocksize(x))
#>          xmin     xmax     ymin     ymax
#> [1,] 5.741667 6.533333 49.83333 50.19167
#> [2,] 5.741667 6.533333 49.47500 49.83333
#> [3,] 5.741667 6.533333 49.44167 49.47500

Created on 2024-07-17 with reprex v2.1.0

njtierney commented 1 month ago

I like the idea of a template object being passed along.

I think that if we have default behaviour for not specifying nrow/ncol we would be introducing multiple APIs in the same function tar_terra_tiles:

  1. Specify ncol/nrow
  2. Don't specify ncol/nrow, use blocksize by default

This also introduces additional complexity into the code now to manage these two approaches. It's not a lot of extra complexity, but I would prefer to avoid scope creep early.

As a user I think this could be harder to understand / predict what rast_2x2, and rast_blocksize are doing:

  tar_target(
    elev_file,
    system.file("ex/elev.tif", package="terra"),
    format = "file"
  ),
  tar_terra_rast(
    elev_rast,
    terra::rast(elev_file)
  ),
  tar_terra_tiles(
    name = rast_2x2,
    raster = elev_rast,
    ncol = 2,
    nrow = 2
  ),
  tar_terra_tiles(
    name = rast_blocksize,
    raster = elev_rast
  )

So I think your idea of replacing nrow/ncol is good, we might need to make the user work a little harder and have to supply a template in the form of xmin/xmax/ymin/ymax, but we can provide some tools/examples to make that easier. This also means we can separate out the 3 approaches I listed above into separate functions, that all produce extent. I haven't wrapped up these functions, but here is what a pipeline could look like with template

  tar_target(
    elev_file,
    system.file("ex/elev.tif", package="terra"),
    format = "file"
  ),
  tar_terra_rast(
    elev_rast,
    terra::rast(elev_file)
  ),
  tar_target(tile_2x2, 
    terra::rast(
      terra::ext(elev_rast), 
      ncol = 2, 
      nrow = 2, 
      crs = terra::crs(elev_rast)
      )
  ),
  tar_target(
    ext_2x2,
    terra::getTileExtents(elev_rast, template)
  ), 
  tar_target(
   ext_blocksize,
   getTileExtents(elev_rast, y = fileBlocksize(elev_rast))
  ),
  tar_terra_tiles(
    name = rast_2x2,
    raster = elev_rast,
    template = ext_2x2
  ),
  tar_terra_tiles(
    name = rast_blocked,
    raster = elev_rast,
    template = ext_blocksize
  )
Aariq commented 1 month ago

The thing I don't love about this is that we're no longer taking advantage of an opportunity to introduce some consistency that isn't there with getTileExtents(), so this wouldn't be as much of a value add as with simple ncol and nrow args. template can be a SpatRaster, a SpatVector, or a length one or two numeric vector, which is confusing. Using fileBlocksize() as a default would help so that users don't have to deal with the template arg unless they want to.

I also think it would be ideal if you could just supply the template directly for simplicity of _targets.R. Something like:

tar_terra_tiles(
  name = rast_blocked,
  raster = elev_rast,
  template = fileBlocksize(elev_rast) #this would be default if `template` is omitted or NULL.
 )

or

tar_terra_tiles(
  name = rast_blocked,
  raster = elev_rast,
  template = rast(ext(elev_rast), ncol = 2, nrow = 2, crs = crs(elev_rast))
 )

Or maybe, as I think you're suggesting, there are some helper functions so we could do something like:

tar_terra_tiles(
  name = rast_blocked,
  raster = elev_rast,
  template = \(x) tile_blocksize(x) #or just `template = tile_blocksize`
 )

OR


tar_terra_tiles(
  name = rast_blocked,
  raster = elev_rast,
  template = \(x) tile_n(x, ncol = 2, nrow = 2)
 )
njtierney commented 1 month ago

template can be a SpatRaster, a SpatVector, or a length one or two numeric vector, which is confusing

I think perhaps we have overloaded the term template here - you aren't referring to template arg of tar_terra_tiles()?

I was thinking that template would just be a vector that is the extent: xmin, xmax, ymin, ymax, or perhaps using the tile index information from grout: https://github.com/hypertidy/grout. The reason I like the output from grout is that we have an abstraction around tiling that provides flexibility.

Potentially, you could craft your own extents that aren't tiles exactly, I cannot imagine a solid/common usecase for that, but it might be handy?

Annoyingly, extent as returned by terra is an pointer, which means we might need a custom target function to handle terra::ext, so you can't just supply that as I have in my above example. Another reason that just returning data, as grout does, might be a goer.

library(terra)
#> terra 1.7.78
elev_file <- system.file("ex/elev.tif", package = "terra")
r <- rast(elev_file)
r_ext <- ext(r)
class(r_ext)
#> [1] "SpatExtent"
#> attr(,"package")
#> [1] "terra"
as.numeric(r_ext)
#> Error in as.numeric(r_ext): cannot coerce type 'S4' to vector of type 'double'

Created on 2024-07-19 with reprex v2.1.1

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.4.1 (2024-06-14) #> os macOS Sonoma 14.5 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Australia/Hobart #> date 2024-07-19 #> pandoc 3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0) #> codetools 0.2-20 2024-03-31 [2] CRAN (R 4.4.1) #> digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.0) #> evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0) #> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0) #> fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0) #> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0) #> knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0) #> Rcpp 1.0.12 2024-01-09 [1] CRAN (R 4.4.0) #> reprex 2.1.1 2024-07-06 [1] CRAN (R 4.4.0) #> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0) #> rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0) #> rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0) #> terra * 1.7-78 2024-05-22 [1] CRAN (R 4.4.0) #> withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0) #> xfun 0.45 2024-06-16 [1] CRAN (R 4.4.0) #> yaml 2.3.9 2024-07-05 [1] CRAN (R 4.4.0) #> #> [1] /Users/njt/Library/R/arm64/4.4/library #> [2] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

Using fileBlocksize() as a default would help so that users don't have to deal with the template arg unless they want to.

Yes I think I like this as a default for tar_terra_tiles().

I was not imagining an anonymous function, but rather something like:

tar_target(
  rast_blocksize,
  tile_blocksize(x)
),
tar_terra_tiles(
  name = rast_blocked,
  raster = elev_rast,
  template = rast_blocksize # or  tile_blocksize(x) or tile_n(x, ncol = 2, nrow = 2)
 )
njtierney commented 1 month ago

Tagging @mdsumner in here as he wrote grout and I'd be curious to hear his thoughts on this

Aariq commented 1 month ago

template can be a SpatRaster, a SpatVector, or a length one or two numeric vector, which is confusing

I think perhaps we have overloaded the term template here - you aren't referring to template arg of tar_terra_tiles()?

I was thinking that template would just be a vector that is the extent: xmin, xmax, ymin, ymax, or perhaps using the tile index information from grout: https://github.com/hypertidy/grout. The reason I like the output from grout is that we have an abstraction around tiling that provides flexibility.

I was thinking that the template argument of tar_terra_tiles() would just get passed to the y argument of getTileExtents(), but it doesn't have to be that way. I want to avoid having to craft my own object for whatever gets used as a template—I usually will just want to split a raster into a reasonable number of tiles and don't care what shape they are or if they're all even.

Annoyingly, extent as returned by terra is an pointer, which means we might need a custom target function to handle terra::ext, so you can't just supply that as I have in my above example. Another reason that just returning data, as grout does, might be a goer.

Yeah, this is exactly why I ended up writing create_tile_exts() to return a list of numeric vectors. So instead of being a basically internal function, it could be repurposed as a helper function for the template arg of tar_terra_tiles()

I was not imagining an anonymous function, but rather something like:

tar_target(
  rast_blocksize,
  tile_blocksize(x)
),
tar_terra_tiles(
  name = rast_blocked,
  raster = elev_rast,
  template = rast_blocksize # or  tile_blocksize(x) or tile_n(x, ncol = 2, nrow = 2)
 )

This would be fine (although I think x above would have to be elev_rast for this to work), but again, I'd like to find a solution where I really don't have to think much about constructing the template object unless I want to. That includes (ideally, from my perspective) not having to create a separate target for it manually.

mdsumner commented 1 month ago

I think specifying tile size is a good default, check that it's a valid value - a reasonable fallback is to treat each row as a tile and that's probably what terra does (it's what raster did, and is what GDAL does).

It doesn't make sense to me to have the user specify a tiling dimension, because the opportunity for optimization is about the size of each tile and a multiplier of that is very simple.

Also the terra interface for it has some unexpected behaviours for centring the scheme, and I think that logic belongs elsewhere ( like terra itself).

I personally need the logic for the schemes completely independently of any package or library, it's extremely powerful (see how much the xarray community cares about "chunking" as one example, but even determining the number and size of tiles and where they go - might be in memory to S3 storage for example)

Aariq commented 5 days ago

Going to close this as most of this advice was implemented in #84 and #90