njtierney / geotargets

Targets extensions for geospatial data
https://njtierney.github.io/geotargets/
Other
49 stars 4 forks source link

Add support for terra SpatVectorProxy, and `format="file"` for SpatRaster #42

Open brownag opened 3 months ago

brownag commented 3 months ago

The {terra} SpatVectorProxy allows you to create a "lazy" reference to a vector dataset with terra::vect(..., proxy=TRUE) that you can query with terra::query() rather than loading all attributes and geometry into memory. This is very helpful and can be much more efficient when working with portions of large data. The SpatVector is always in memory, SpatVectorProxy never in memory, and SpatRaster is in memory if it is sufficiently small, otherwise it automatically behaves as if it were a "SpatRasterProxy" to the source file.

Currently, tar_terra_vect() cannot handle SpatVectorProxy because there is no SpatVectorProxy writeVector() method:

library(targets)

tar_script({
    list(
        geotargets::tar_terra_vect(test_terra_proxy,
                                   terra::vect(system.file("ex", "lux.shp", package = "terra"), proxy = TRUE),
                                   filetype = "Parquet")
    )
})

tar_make()
#> Loading required namespace: terra
#> ▶ dispatched target test_terra_proxy
#> ✖ errored target test_terra_proxy
#> ✖ errored pipeline [0.121 seconds]
#> Error:
#> ! Error running targets::tar_make()
#> Error messages: targets::tar_meta(fields = error, complete_only = TRUE)
#> Debugging guide: https://books.ropensci.org/targets/debugging.html
#> How to ask for help: https://books.ropensci.org/targets/help.html
#> Last error message:
#>     _store_ unable to find an inherited method for function ‘writeVector’ for signature ‘"SpatVectorProxy", "character"’
#> Last error traceback:
#>     No traceback available.
# ...

A philosophical question is whether creating a target for a SpatVectorProxy should copy the full source data to the target store, as we do for vector objects in memory, OR create a format="file" target for the data source returned by terra::sources().

For a proxy object, I think I might often prefer the latter option. On one hand the former might be more reproducible in general as the source data get copied, but essentially we have this as the default SpatVector and SpatRaster approach already.

I often work with some fairly large file-based databases using a SpatVectorProxy or large SpatRaster initially and materializing only small portions relevant to specific areas later with query() or crop() or similar. Usually I would be fine to have targets just track the state of the source file, rather than a full copy of the data, as those things are not changing much, and often can be downloaded through standard methods (and the download could be a preceding target in the pipeline, prior to creating the SpatVectorProxy/SpatRaster)

Perhaps the methods we have implemented should have an option to utilize an existing source and a format="file" approach. This would be the default behavior for SpatVectorProxy, and default could be based on terra::inMemory() for SpatRaster.

I see a few problems with the above:

mdsumner commented 3 months ago

Do you know how to materialize a proxy vect? it doesn't seem to have any behaviours beyond crs(), ext(), dim() and plot() - which plots the extent only. I'd expect vect() at least to work but it does not.

I'm super interested in the concerns here and the topic of GDAL generally.

brownag commented 3 months ago

Do you know how to materialize a proxy vect?

I think you want terra::query() for that i.e.

library(terra)
#> terra 1.7.74

x <- vect(system.file("ex", "lux.shp", package = "terra"), proxy=TRUE)
x
#>  class       : SpatVectorProxy
#>  geometry    : polygons 
#>  dimensions  : 12, 6  (geometries, attributes)
#>  extent      : 5.826232, 6.16085, 49.94611, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : lux.shp
#>  layer       : lux 
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1 NAME_1  ID_2 NAME_2  AREA   POP
#>  type        : <num>  <chr> <num>  <chr> <num> <int>

y <- query(x)
y
#>  class       : SpatVector 
#>  geometry    : polygons 
#>  dimensions  : 12, 6  (geometries, attributes)
#>  extent      : 5.74414, 6.528252, 49.44781, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : lux.shp
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1   NAME_1  ID_2   NAME_2  AREA   POP
#>  type        : <num>    <chr> <num>    <chr> <num> <int>
#>  values      :     1 Diekirch     1 Clervaux   312 18081
#>                    1 Diekirch     2 Diekirch   218 32543
#>                    1 Diekirch     3  Redange   259 18664

y2 <- query(x, sql="SELECT * FROM lux LIMIT 5")
y2
#>  class       : SpatVector 
#>  geometry    : polygons 
#>  dimensions  : 5, 6  (geometries, attributes)
#>  extent      : 5.74414, 6.315773, 49.69933, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : lux.shp
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1   NAME_1  ID_2   NAME_2  AREA   POP
#>  type        : <num>    <chr> <num>    <chr> <num> <int>
#>  values      :     1 Diekirch     1 Clervaux   312 18081
#>                    1 Diekirch     2 Diekirch   218 32543
#>                    1 Diekirch     3  Redange   259 18664

I agree it perhaps would make sense to be able to do these same operations via vect()

mdsumner commented 3 months ago

I guess rast() drops reference to actual data so perhaps vect is the wrong expectation