Ideas on generalization of spatial package backends and file sources using GDAL (terra, sf, stars, etc.)

brownag commented 4 months ago

I wanted to throw up some ideas for discussion, might be a bit rambling for a single issue. Happy to break off any particular items as new issues or address in specific PRs; I will submit some draft PRs once I have fleshed these ideas out. I say "we" a lot in here but ultimately I am just one interested opinion and welcome any thoughts or alternatives.

The current target storage format functions defined are file-format centric. This is great, because GDAL is the library behind the scenes for common interfaces to a variety of different file formats. GDAL is used in several R spatial packages notably: sf, terra, and stars. I think this project should abstract out the functionality for GDAL data source paths and provide support for multiple R package/object type interfaces in the result the user sees.

In my opinion, {geotargets} should provide default behavior based on type of spatial data, i.e. vector geometry vs. raster--this is so the user doesn't have to think too much about the formats in their target store, just that they are able to roundtrip an R object equivalent to what they started with. If they care about the format, they should have the ability to choose.

I'd like to make (or suggest others make) a couple PRs to implement: 1) Spatial backend options to allow, for example: GeoTIFF format object with {stars} or Shapefile format with {sf} 2) Generalization of the "multiple file target compression" GDAL /vsizip/ approach to all backends and formats that support it

These should provide some room for discussion about specifics how the group wants to abstract or break out functionality.

Spatial backends based on GDAL

I imagine some users don't care so much what file format their target store contains, but likely will care more about the object types that are returned and the associated packages. The object type matters because of chosen dependencies and preferred workflows of the user. The file type may matter especially when it comes time to read targets back in, in part or in full, when they start taking up a lot of disk space, or some step in the process requires a specific format.

We may not want to require users to load both {sf} and {terra}, for example
- Package usage gated by requireNamespace() and having all of these types of packages that produce the user-facing object in Suggests seems like a good strategy. The alternative would be to say, pick {terra} for use internally and then provide conversion methods for compatibility with other ({sf}/{stars}) objects as input/output.
- I personally am a big fan of {terra}, but still use {sf} for quite a few things. {terra} is great in that it can do both vector and raster data, but there are many R spatial users and a much broader R ecosystem built around {sf}. I think users should be able to avoid one or the other, or interchange as needed, in their workflows if they need to be able to.

Specific result types (e.g. sf data.frame, or lazy tbl, vs SpatVector/SpatVectorProxy) would be customize with options set for the whole pipeline, for a target factory, or in wrapper functions.

For example:

In addition to a tar_geotiff() with multiple options set we could have functions liketar_geotiff_stars() and tar_geotiff_terra(). More generic functions would be possible if we abstract out the file type for all GDAL drivers, you might have tar_vector_sf(filetype="parquet") or tar_vector(filetype="ESRI Shapefile", package="terra")
Target factories and formats could utilize default arguments, possibly customized based on selected filetype; they might read a targets option or environment variable, or be settable through a function {geotargets} would offer.
- Some formats have specific limitations. For example I think you would "always" need to use /vsizip/ or similar compressed file option if you need your target to be stored as a Shapefile (multiple files). So, /vsizip/ would be default in a tar_shapefile_*() helper method. Perhaps such a function would be better named tar_shapefile_zip() to indicate that it only works with target names ending in a ".zip" suffix (which is something users may prefer to avoid)

Generalization of compression for spatial targets with GDAL

The /vsizip/ GDAL virtual file system functionality used in format_shapefile() is an example of something that can be generalized further with a focus on generic GDAL data source paths. I think the idea of being able to compress files that are in the target store (and keep them compressed) is attractive for spatial data which can be quite large--even if targets are not comprised of multiple files.

Since GDAL can read from the compressed target store efficiently, you get the benefit of less file size footprint while also being able to read the file without fully extracting it.
- Also should consider some of the other archive file formats/virtual file system types, and providing interfaces in R to produce them e.g. /vsigzip/ or /vsitar/ analogs to /vsizip/ + utils::zip().
Even without creating specific compressed archive files, there should be robust tools available for controlling GDAL file compression options, supported by many drivers, that are used to write target objects
The ZIP approach is useful for GeoTIFF files where category information is stored in the .tif.aux.xml sidecar file. Convenience methods for terra SpatRaster objects could automatically store a target as a ZIP file (and give warnings about target naming) if the input SpatRaster is categorical and output format is GeoTIFF.

brownag commented 4 months ago

Regarding Item 2 (/vsizip/), it should be possible to drop the .zip extension from file and target name. I was not aware of the alternate syntax!

From https://gdal.org/user/virtual_file_systems.html#vsizip-zip-archives:

Starting with GDAL 2.2, an alternate syntax is available so as to enable chaining and not being dependent on .zip extension, e.g.: /vsizip/{/path/to/the/archive}/path/inside/the/zip/file. Note that /path/to/the/archive may also itself use this alternate syntax.

Aariq commented 4 months ago

These are great ideas and I totally agree that the ideal situation would be one in which users can provide format = tar_raster or format = tar_vector and geotargets takes care of figuring out if the data is coming from terra, sf, stars, etc. and has some defaults for how targets are stored. I think the tricky part, which I don't quite understand if you have a plan for, is how those targets should be read back in and "unmarshalled". How can we know if the current target assumes an upstream target is a terra SpatVector or a sf MULTIPOLYGON? I would think geotargets would have to be as specific as providing tar_<vector/raster>_<package name> (e.g. tar_vector_terra, tar_vector_sf) as formats.

mdsumner commented 4 months ago

excellent descript @brownag

note there is also {gdalraster} now with already support for SOZip creation and file management, and nascent 'gdalvector' support:

https://usdaforestservice.github.io/gdalraster/reference/addFilesInZip.html

https://usdaforestservice.github.io/gdalraster/articles/gdalvector-draft.html

gdalraster has become very richly featured very quickly, and could be the GDAL API that's otherwise entirely missing from R atm.

brownag commented 4 months ago

I think the tricky part, which I don't quite understand if you have a plan for, is how those targets should be read back in and "unmarshalled". How can we know if the current target assumes an upstream target is a terra SpatVector or a sf MULTIPOLYGON? I would think geotargets would have to be as specific as providing tar<vector/raster> (e.g. tar_vector_terra, tar_vector_sf) as formats.

I lean towards this conclusion also-- so that the choice is explicit and not changeable through magical options or settings. How these operations would be handled on the backend could/should be more generic, but for reproducibility and clarity it is probably best to give users options that require explicit choices. This might mean many combinations of thin wrapper methods around the core functions, but I don't think that is inherently bad as long as there is some overall order to how they are named.

excellent descript @brownag

note there is also {gdalraster} now with already support for SOZip creation and file management, and nascent 'gdalvector' support:

https://usdaforestservice.github.io/gdalraster/reference/addFilesInZip.html

https://usdaforestservice.github.io/gdalraster/articles/gdalvector-draft.html

gdalraster has become very richly featured very quickly, and could be the GDAL API that's otherwise entirely missing from R atm.

Sweet! I have seen {gdalraster} and watched some of the (rapid) progress on that with interest... but I don't think I was aware of the plans to provide bindings for the OGR vector API! I have used your vapour package for some of my generic/vector GDAL needs that go beyond terra/sf

Something truly generic, mirroring the GDAL API/"close to the GDAL metal" would allow for all sorts of capabilities and customization--perhaps {gdalraster} would be a good choice for an imported package doing the core backend work for plumbing to the various user-facing types/formats.

njtierney commented 4 months ago

Thanks for this @brownag !

Following on from @Aariq 's #7 - I quite like the idea of tar_{pkg}_<filetype> convention. I think that a lot of/most users would know what package they are reading/creating things with, so it should hopefully facilitate discovery in that way?

I'm still learning about a lot of spatial things, so there is a bit of this I don't quite understand, but I think that this issue could be split out into multiple components, eventually, as there are a few threads in here.

Overall my preference for syntax would be something like:

tar_terra_raster(
  new_raster
  raster_creation_function(args)
)

But overall this would work the same as:

tar_target(
  new_raster
  raster_creation_function(args),
  format = "format_terra_raster"
)

Or something?

with tar_{pkg}_<filetype> we can expose a filetype argument to give users control, e.g.,:

tar_terra_shapefile(
  my_shape,
  create_shapefile(args),
  filetype = "parquet"
)

But then I wonder if

tar_parquet_shapefile(
  my_shape,
  create_shapefile(args)
)

Would be better?

Naming things is hard. But I think that it is worthwhile thinking about the API design - once we have ideas on how we want the user to interact with the package my experience is that it is usually easier to write the code.

Aariq commented 4 months ago

Naming things is hard. But I think that it is worthwhile thinking about the API design - once we have ideas on how we want the user to interact with the package my experience is that it is usually easier to write the code.

I agree we should try to think through this and make some decisions before getting too far. Super helpful discussion here.

Aariq commented 3 months ago

I think everything in here is either in the package already, a PR, or a separate issue. Thanks for the contributions @brownag!

njtierney / geotargets

Ideas on generalization of spatial package backends and file sources using GDAL (terra, sf, stars, etc.) #4

Spatial backends based on GDAL

Generalization of compression for spatial targets with GDAL