njtierney / geotargets

Targets extensions for geospatial data
https://njtierney.github.io/geotargets/
Other
49 stars 4 forks source link

Implement tar_sf_* family of targets #13

Open njtierney opened 4 months ago

njtierney commented 4 months ago

Related to #4

We need to list the datatypes from sf that we want to support

brownag commented 4 months ago

Some ideas of what could be implemented for object types in {sf}

For vector data, filetypes supported by GDAL for read/write could be used as the format for the file in the target store (including Parquet/Arrow), or you could use {arrow} per discussion in #2. Using arrow is probably a great default behavior for vector data due to efficiency/small size. Since sf objects in memory do not have issue with having an internal pointer, they can be serialized to normal 'rds' format files as well.

In the case of sfc/sfg only one column is being managed, but storage formats would likely be similar to what would be used for tar_sf() and the read/write method would just return the column of interest.

For other classes in {sf} like bbox/crs I don't think there is much benefit to having specific methods for these, they are simple enough to store with tar_target() in an ordinary "rds" format.

defuneste commented 3 months ago

Hello! Thanks for working on that this is great!

A quick idea/question:

sf is using two different engine (geos, s2) and use a setting function (sf_use_s2) to affect that. What could be a desired design for dealing with it in a "target" mindset? Is it something that should be set for a specific target or set at the workflow (or it depends)?

Aariq commented 3 months ago

sf is using two different engine (geos, s2) and use a setting function (sf_use_s2) to affect that. What could be a desired design for dealing with it in a "target" mindset? Is it something that should be set for a specific target or set at the workflow (or it depends)?

I'm currently including sf_use_s2() inside target functions when I need to, but I'm relatively new to using sf. Is it common to need to set sf_use_s2() for an entire workflow? If so, I can imagine this being an option in tar_sf*() that could be optionally set for the whole workflow.

brownag commented 3 months ago

In my work I have had cases where I set sf_use_s2(FALSE) for an entire workflow. This is often because I just don't "need" s2 and dont want to add extra boilerplate to fix geometries. I have seen many instances where geometries being operated on may be "invalid" from the perspective of s2, but would at least be functional (even if technically invalid) using geos backend.

Errors like this come up a lot with invalid geometry

Error in wk_handle.wk_wkb(wkb, s2_geography_writer(oriented = oriented,  : 
  Loop 1 is not valid: Edge 0 crosses edge 14

These things are usually fixable by sf::st_make_valid() before the operation that generates above error.

For geotargets, I think we can't assume the user is going to always be working with valid geometries, nor do we want to "fix" them for them. I think there should be a way to set/get the sf option "sf_use_s2" for a whole workflow as the results will subtly differ depending on the engine used

Aariq commented 3 months ago

Makes sense. So we'll add use_s2 to the list of options that can be set for an entire pipeline (by geotargets_options_set()) and in individual tar_sf_*() targets.

defuneste commented 3 months ago

I am running in the same "issue" describing by @brownag (s2 is a bit more "difficult"). If my understanding of how sf_use_s2() is correct even inside a function it will change how sf behave in an R session (and that can be tricky for non aware users).

 library("sf")
sf_use_s2()
#[1] TRUE
jim <- function() sf_use_s2(FALSE)
jim()
#Spherical geometry (s2) switched off
 sf_use_s2()
#[1] FALSE

I think adding in the options_set is a good idea.

The topic of invalid geometry is an important one that also probably deserve it specific issue (it will also depend on GEOS version).

Aariq commented 1 month ago

Since each target is run in a separate clean R session, it shouldn't be a problem to include sf_use_s2() at the top of a function used in a single target—that shouldn't effect the behavior of other targets. For the same reason though, it could be difficult to change the default for an entire workflow using sf_use_s2() and makes sense to consider an option that gets passed to each target.