mapme-initiative / mapme.biodiversity

Efficient analysis of spatial biodiversity datasets for global portfolios
https://mapme-initiative.github.io/mapme.biodiversity/
GNU General Public License v3.0
24 stars 7 forks source link

Use closures for resource and indicator functions #232

Closed goergen95 closed 4 months ago

goergen95 commented 6 months ago

I would like to propose to use closures for the definition of resource and indicator functions. Closures are function that return functions, thus they have two possible layers of arguments. We can use the first layer to expose the user-facing arguments to users of the package. The second layer than only requires package-internal arguments (that with this we could also check and emmit warnings if not matched.

I think the benefits for the users is a clearer UI: You have a single function for each resource/indicator that you can also look up in the help pages. Internally, there is the benefit that there are no ambiguties to which resource/indicator an arguemnt actually belongs. For developers, there is a little overhead introduced because the functions have to be wrapped in a second layer but it also gives much more freedom in the sense of user-facing arguement names etc.

If we started development on this I would suggest to do it on top of the into-the-clouds branch in order to cause less fricition once that one is merged.

Below is a simple mock-up of the interface would look like:

library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.4, PROJ 9.1.1; sf_use_s2() is TRUE
library(terra)
#> terra 1.7.68
require(mapme.biodiversity)
#> Loading required package: mapme.biodiversity

port <- st_read(system.file("extdata/sierra_de_neiba_478140_2.gpkg", package = "mapme.biodiversity"))
#> Reading layer `sierra_de_neiba_478140_2' from data source 
#>   `/home/darius/R/x86_64-pc-linux-gnu-library/4.3/mapme.biodiversity/extdata/sierra_de_neiba_478140_2.gpkg' 
#>   using driver `GPKG'
#> Simple feature collection with 1 feature and 4 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -71.80933 ymin: 18.57668 xmax: -71.33201 ymax: 18.69931
#> Geodetic CRS:  WGS 84

# calc indicators expects several functions as the input to ...
# as well as some args for the processing
calc_indicators <- function(x, ..., rundir = tempdir(), verbose = FALSE){
  indicators <- list(...)
  lapply(indicators, function(f) .get_single_indicator(x, f, rundir, verbose))
}

# mock-up of pacakge internal function that prepares required resources
.get_single_indicator <- function(x, f, rundir, verbose){
  # reading dem as a .read_resource mock-up
  dem <- rast(system.file("res/nasa_srtm/NASADEM_HGT_n18w072.tif", package = "mapme.biodiversity"))
  # call the indicator function with the required resource
  do.call(f, list(x = x, dem = dem, rundir = rundir, verbose = verbose))
}

# wrapped indicator function, first level only includes user facing arguments
# second level includes the portfolio, required resources and standard arguments
calc_elevation <- function(stat, other_arg = NULL) {
  function(x, dem = NULL, rundir = tempdir(), verbose = FALSE){
    if (verbose && !is.null(other_arg)) print(other_arg)
    if (is.null(dem)) return(NULL)
    extract(dem, x)[, 2] |> stat(na.rm = TRUE)
  }
}

# the outer wrapper can now be called by users, benefits are auto-complete of args
# and it is clearer which argument belongs to a specific indicator
# Note, users are only directly facing the `calc_elevation()` function in the docu
# and in their scripts
calc_indicators(port, calc_elevation(stat = mean), rundir = "path/to/my-dir")
#> [[1]]
#> [1] 1704.433
calc_indicators(port, calc_elevation(stat = median, other_arg = "Hello World!"), rundir = "path/to/my-dir", verbose = TRUE)
#> [1] "Hello World!"
#> [[1]]
#> [1] 1702
calc_indicators(port, calc_elevation(stat = median), calc_elevation(stat = sd))
#> [[1]]
#> [1] 1702
#> 
#> [[2]]
#> [1] 219.2833

# direct usage e.g. in tests
elev <- calc_elevation(stat = mean)
elev(port, rast(system.file("res/nasa_srtm/NASADEM_HGT_n18w072.tif", package = "mapme.biodiversity")))
#> [1] 1704.433

Created on 2024-01-08 with reprex v2.0.2

karpfen commented 5 months ago

I like this a lot, especially calculating multiple indicators in one go will be much cleaner with this.

One thing to consider. If a user wants to calculate a lot of indicators and one of them fails (e.g. because wrong parameter, resource missing, etc.) I think the rest of the indicators should still be calculated if possible.

goergen95 commented 5 months ago

Maybe we can have something like a fail_early option. Say, you are working with a very large portfolio and request several indicators and some errors occur. You will just learn about that at the very end (at least if you use parallelization, because I think I have not figured out how to print errors directly in that mode) which might actually be after a considerable amount of time. Though I think that arguments should be checked in the outer function, so you get an error immediately when you supply e.g. a character instead of a numeric.

goergen95 commented 5 months ago

Oh, also I'd like to reconsider the return value of indicators if something does not match up. I think that NULL is a much more sensible value, but this is yet another issue. :smile:

goergen95 commented 4 months ago

This is currently WIP as announced in #240. Closing here, once the branch is ready for testing and issues arise it is better to put those in specific discussion threads.