Add support for terra SpatVectorProxy, and `format="file"` for SpatRaster

brownag commented 3 months ago

The {terra} SpatVectorProxy allows you to create a "lazy" reference to a vector dataset with terra::vect(..., proxy=TRUE) that you can query with terra::query() rather than loading all attributes and geometry into memory. This is very helpful and can be much more efficient when working with portions of large data. The SpatVector is always in memory, SpatVectorProxy never in memory, and SpatRaster is in memory if it is sufficiently small, otherwise it automatically behaves as if it were a "SpatRasterProxy" to the source file.

Currently, tar_terra_vect() cannot handle SpatVectorProxy because there is no SpatVectorProxy writeVector() method:

library(targets)

tar_script({
    list(
        geotargets::tar_terra_vect(test_terra_proxy,
                                   terra::vect(system.file("ex", "lux.shp", package = "terra"), proxy = TRUE),
                                   filetype = "Parquet")
    )
})

tar_make()
#> Loading required namespace: terra
#> ▶ dispatched target test_terra_proxy
#> ✖ errored target test_terra_proxy
#> ✖ errored pipeline [0.121 seconds]
#> Error:
#> ! Error running targets::tar_make()
#> Error messages: targets::tar_meta(fields = error, complete_only = TRUE)
#> Debugging guide: https://books.ropensci.org/targets/debugging.html
#> How to ask for help: https://books.ropensci.org/targets/help.html
#> Last error message:
#>     _store_ unable to find an inherited method for function ‘writeVector’ for signature ‘"SpatVectorProxy", "character"’
#> Last error traceback:
#>     No traceback available.
# ...

A philosophical question is whether creating a target for a SpatVectorProxy should copy the full source data to the target store, as we do for vector objects in memory, OR create a format="file" target for the data source returned by terra::sources().

For a proxy object, I think I might often prefer the latter option. On one hand the former might be more reproducible in general as the source data get copied, but essentially we have this as the default SpatVector and SpatRaster approach already.

I often work with some fairly large file-based databases using a SpatVectorProxy or large SpatRaster initially and materializing only small portions relevant to specific areas later with query() or crop() or similar. Usually I would be fine to have targets just track the state of the source file, rather than a full copy of the data, as those things are not changing much, and often can be downloaded through standard methods (and the download could be a preceding target in the pipeline, prior to creating the SpatVectorProxy/SpatRaster)

Perhaps the methods we have implemented should have an option to utilize an existing source and a format="file" approach. This would be the default behavior for SpatVectorProxy, and default could be based on terra::inMemory() for SpatRaster.

I see a few problems with the above:

I suppose that format="file" would only work for source formats that are a file to begin with (e.g. GeoPackage, FGDB, SQLite, DuckDB, Parquet...) but not for true database drivers like PostgreSQL. I think this can be addressed as a different issue for true database sources, where instead of a format="file" you store a checksum for a table or query result from a database source, and possibly? something about the database connection.
{terra} will automatically write temporary files for raster operations that are too big to be done entirely in memory. This means inMemory() could return FALSE, but a target could be created based on a temporary file that will be deleted after the R session cleans up, invalidating downstream targets. In the case a temp file is used, rather than a file path specifically chosen by the user, we may not want to automatically decide for the user whether the "proxy" behavior of SpatRaster should be triggered.

mdsumner commented 3 months ago

Do you know how to materialize a proxy vect? it doesn't seem to have any behaviours beyond crs(), ext(), dim() and plot() - which plots the extent only. I'd expect vect() at least to work but it does not.

I'm super interested in the concerns here and the topic of GDAL generally.

brownag commented 3 months ago

Do you know how to materialize a proxy vect?

I think you want terra::query() for that i.e.

library(terra)
#> terra 1.7.74

x <- vect(system.file("ex", "lux.shp", package = "terra"), proxy=TRUE)
x
#>  class       : SpatVectorProxy
#>  geometry    : polygons 
#>  dimensions  : 12, 6  (geometries, attributes)
#>  extent      : 5.826232, 6.16085, 49.94611, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : lux.shp
#>  layer       : lux 
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1 NAME_1  ID_2 NAME_2  AREA   POP
#>  type        : <num>  <chr> <num>  <chr> <num> <int>

y <- query(x)
y
#>  class       : SpatVector 
#>  geometry    : polygons 
#>  dimensions  : 12, 6  (geometries, attributes)
#>  extent      : 5.74414, 6.528252, 49.44781, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : lux.shp
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1   NAME_1  ID_2   NAME_2  AREA   POP
#>  type        : <num>    <chr> <num>    <chr> <num> <int>
#>  values      :     1 Diekirch     1 Clervaux   312 18081
#>                    1 Diekirch     2 Diekirch   218 32543
#>                    1 Diekirch     3  Redange   259 18664

y2 <- query(x, sql="SELECT * FROM lux LIMIT 5")
y2
#>  class       : SpatVector 
#>  geometry    : polygons 
#>  dimensions  : 5, 6  (geometries, attributes)
#>  extent      : 5.74414, 6.315773, 49.69933, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : lux.shp
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1   NAME_1  ID_2   NAME_2  AREA   POP
#>  type        : <num>    <chr> <num>    <chr> <num> <int>
#>  values      :     1 Diekirch     1 Clervaux   312 18081
#>                    1 Diekirch     2 Diekirch   218 32543
#>                    1 Diekirch     3  Redange   259 18664

I agree it perhaps would make sense to be able to do these same operations via vect()

mdsumner commented 3 months ago

I guess rast() drops reference to actual data so perhaps vect is the wrong expectation

njtierney / geotargets

Add support for terra SpatVectorProxy, and `format="file"` for SpatRaster #42