Improve runtime for small assets

goergen95 commented 4 months ago

lowers progress updates for large numbers of assets
removes explicit garbage collections calls
simplifies .check_asset
applies chunking based on the area of asset's bounding box
sets log_dir to NULL by default

goergen95 commented 4 months ago

I get a rough ~50% decrease in average computation time with:

#remotes::install_github("mapme-initiative/mapme.biodiversity", ref = "main")
#remotes::install_github("mapme-initiative/mapme.biodiversity", ref = "improve-runtime-for-small-assets")
library(mapme.biodiversity)
library(sf)
library(microbenchmark)

outdir <- file.path(tempdir(), "mapme.data")
dir.create(outdir)
mapme.biodiversity:::.copy_resource_dir(outdir)

x <- read_sf(
  system.file("extdata", "gfw_sample.gpkg", package = "mapme.biodiversity")
)

mapme_options(
  outdir = outdir,
  verbose = TRUE
)

x <- get_resources(
  x,
  get_gfw_treecover(version = "GFC-2020-v1.8"),
  get_gfw_lossyear(version = "GFC-2020-v1.8")
)

# call once to load namespace
calc_indicators(
  x,
  calc_treecover_area(
    years = 2000:2005, min_size = 5, min_cover = 30)
)

microbenchmark(
  branch = {
    calc_indicators(
      x,
      calc_treecover_area(
        years = 2000:2005, min_size = 5, min_cover = 30)
    )}
)

I get for main (d94fad068):

Unit: milliseconds

   expr      min       lq     mean   median      uq      max neval
 branch 384.5842 388.6725 392.8769 390.0362 393.276 598.3322   100

versus:

Unit: milliseconds

   expr      min       lq     mean   median       uq      max neval
 branch 159.6032 171.4363 182.1979 175.1873 180.3777 482.6047   100

@karpfen: Would you mind to confirm?

goergen95 commented 4 months ago

Adapting above script to process 100 assets via:

mapme_options(
  outdir = outdir,
  verbose = FALSE
)

x <- get_resources(
  x,
  get_gfw_treecover(version = "GFC-2020-v1.8"),
  get_gfw_lossyear(version = "GFC-2020-v1.8")
)

x <- st_as_sf(list_rbind(lapply(1:100, function(i) x)))
x$assetid <- 1:nrow(x)
# call once to load namespace
microbenchmark(
  branch = {
    calc_indicators(
      x,
      calc_treecover_area(
        years = 2000:2005, min_size = 5, min_cover = 30)
    )},
  times = 10
)

yields on main

Unit: seconds
   expr      min       lq    mean   median       uq      max neval
 branch 17.42098 17.43061 17.6243 17.53582 17.68839 18.15672    10

vs:

Unit: seconds
   expr     min       lq     mean   median       uq      max neval
 branch 8.81958 8.903752 9.088585 9.049905 9.153315 9.738127    10

goergen95 commented 4 months ago

Btw. x has roughly 2,300 ha, that is larger than 80% of the assets in the portfolio we were discussing today.

karpfen commented 4 months ago

Wow, I didn't expect that to work so quickly, nice one :) I'll try this out later today

goergen95 commented 4 months ago

Would be nice if you could run the example code to see if the improvement holds on another machine. Note, that speed up in a real use-case will be very much dependent on both the structure of the assets (e.g. are the many multi-polygons) and how you set up parallelization (number of cores on the asset level vs. on the chunk level -> parts of multi-polygons are also processed as chunks).

mapme-initiative / mapme.biodiversity

Improve runtime for small assets #267