njtierney / geotargets

Targets extensions for geospatial data
https://njtierney.github.io/geotargets/
Other
49 stars 4 forks source link

Ideas on generalization of spatial package backends and file sources using GDAL (terra, sf, stars, etc.) #4

Closed brownag closed 3 months ago

brownag commented 4 months ago

I wanted to throw up some ideas for discussion, might be a bit rambling for a single issue. Happy to break off any particular items as new issues or address in specific PRs; I will submit some draft PRs once I have fleshed these ideas out. I say "we" a lot in here but ultimately I am just one interested opinion and welcome any thoughts or alternatives.


The current target storage format functions defined are file-format centric. This is great, because GDAL is the library behind the scenes for common interfaces to a variety of different file formats. GDAL is used in several R spatial packages notably: sf, terra, and stars. I think this project should abstract out the functionality for GDAL data source paths and provide support for multiple R package/object type interfaces in the result the user sees.

In my opinion, {geotargets} should provide default behavior based on type of spatial data, i.e. vector geometry vs. raster--this is so the user doesn't have to think too much about the formats in their target store, just that they are able to roundtrip an R object equivalent to what they started with. If they care about the format, they should have the ability to choose.

I'd like to make (or suggest others make) a couple PRs to implement: 1) Spatial backend options to allow, for example: GeoTIFF format object with {stars} or Shapefile format with {sf} 2) Generalization of the "multiple file target compression" GDAL /vsizip/ approach to all backends and formats that support it

These should provide some room for discussion about specifics how the group wants to abstract or break out functionality.


Spatial backends based on GDAL

I imagine some users don't care so much what file format their target store contains, but likely will care more about the object types that are returned and the associated packages. The object type matters because of chosen dependencies and preferred workflows of the user. The file type may matter especially when it comes time to read targets back in, in part or in full, when they start taking up a lot of disk space, or some step in the process requires a specific format.

Specific result types (e.g. sf data.frame, or lazy tbl, vs SpatVector/SpatVectorProxy) would be customize with options set for the whole pipeline, for a target factory, or in wrapper functions.

For example:


Generalization of compression for spatial targets with GDAL

The /vsizip/ GDAL virtual file system functionality used in format_shapefile() is an example of something that can be generalized further with a focus on generic GDAL data source paths. I think the idea of being able to compress files that are in the target store (and keep them compressed) is attractive for spatial data which can be quite large--even if targets are not comprised of multiple files.

brownag commented 4 months ago

Regarding Item 2 (/vsizip/), it should be possible to drop the .zip extension from file and target name. I was not aware of the alternate syntax!

From https://gdal.org/user/virtual_file_systems.html#vsizip-zip-archives:

Starting with GDAL 2.2, an alternate syntax is available so as to enable chaining and not being dependent on .zip extension, e.g.: /vsizip/{/path/to/the/archive}/path/inside/the/zip/file. Note that /path/to/the/archive may also itself use this alternate syntax.

Aariq commented 4 months ago

These are great ideas and I totally agree that the ideal situation would be one in which users can provide format = tar_raster or format = tar_vector and geotargets takes care of figuring out if the data is coming from terra, sf, stars, etc. and has some defaults for how targets are stored. I think the tricky part, which I don't quite understand if you have a plan for, is how those targets should be read back in and "unmarshalled". How can we know if the current target assumes an upstream target is a terra SpatVector or a sf MULTIPOLYGON? I would think geotargets would have to be as specific as providing tar_<vector/raster>_<package name> (e.g. tar_vector_terra, tar_vector_sf) as formats.

mdsumner commented 4 months ago

excellent descript @brownag

note there is also {gdalraster} now with already support for SOZip creation and file management, and nascent 'gdalvector' support:

https://usdaforestservice.github.io/gdalraster/reference/addFilesInZip.html

https://usdaforestservice.github.io/gdalraster/articles/gdalvector-draft.html

gdalraster has become very richly featured very quickly, and could be the GDAL API that's otherwise entirely missing from R atm.

brownag commented 4 months ago

I think the tricky part, which I don't quite understand if you have a plan for, is how those targets should be read back in and "unmarshalled". How can we know if the current target assumes an upstream target is a terra SpatVector or a sf MULTIPOLYGON? I would think geotargets would have to be as specific as providing tar<vector/raster> (e.g. tar_vector_terra, tar_vector_sf) as formats.

I lean towards this conclusion also-- so that the choice is explicit and not changeable through magical options or settings. How these operations would be handled on the backend could/should be more generic, but for reproducibility and clarity it is probably best to give users options that require explicit choices. This might mean many combinations of thin wrapper methods around the core functions, but I don't think that is inherently bad as long as there is some overall order to how they are named.

excellent descript @brownag

note there is also {gdalraster} now with already support for SOZip creation and file management, and nascent 'gdalvector' support:

https://usdaforestservice.github.io/gdalraster/reference/addFilesInZip.html

https://usdaforestservice.github.io/gdalraster/articles/gdalvector-draft.html

gdalraster has become very richly featured very quickly, and could be the GDAL API that's otherwise entirely missing from R atm.

Sweet! I have seen {gdalraster} and watched some of the (rapid) progress on that with interest... but I don't think I was aware of the plans to provide bindings for the OGR vector API! I have used your vapour package for some of my generic/vector GDAL needs that go beyond terra/sf

Something truly generic, mirroring the GDAL API/"close to the GDAL metal" would allow for all sorts of capabilities and customization--perhaps {gdalraster} would be a good choice for an imported package doing the core backend work for plumbing to the various user-facing types/formats.

njtierney commented 4 months ago

Thanks for this @brownag !

Following on from @Aariq 's #7 - I quite like the idea of tar_{pkg}_<filetype> convention. I think that a lot of/most users would know what package they are reading/creating things with, so it should hopefully facilitate discovery in that way?

I'm still learning about a lot of spatial things, so there is a bit of this I don't quite understand, but I think that this issue could be split out into multiple components, eventually, as there are a few threads in here.

Overall my preference for syntax would be something like:

tar_terra_raster(
  new_raster
  raster_creation_function(args)
)

But overall this would work the same as:

tar_target(
  new_raster
  raster_creation_function(args),
  format = "format_terra_raster"
)

Or something?

with tar_{pkg}_<filetype> we can expose a filetype argument to give users control, e.g.,:

tar_terra_shapefile(
  my_shape,
  create_shapefile(args),
  filetype = "parquet"
)

But then I wonder if

tar_parquet_shapefile(
  my_shape,
  create_shapefile(args)
)

Would be better?

Naming things is hard. But I think that it is worthwhile thinking about the API design - once we have ideas on how we want the user to interact with the package my experience is that it is usually easier to write the code.

Aariq commented 4 months ago

Naming things is hard. But I think that it is worthwhile thinking about the API design - once we have ideas on how we want the user to interact with the package my experience is that it is usually easier to write the code.

I agree we should try to think through this and make some decisions before getting too far. Super helpful discussion here.

Aariq commented 3 months ago

I think everything in here is either in the package already, a PR, or a separate issue. Thanks for the contributions @brownag!