get_resources for global mangrove watch (gmw) is very long - Githubissues

mapme-initiative / mapme.biodiversity

Efficient analysis of spatial biodiversity datasets for global portfolios

https://mapme-initiative.github.io/mapme.biodiversity/

GNU General Public License v3.0

25 stars 7 forks source link

get_resources for global mangrove watch (gmw) is very long #170

Closed fBedecarrats closed 1 year ago

fBedecarrats commented 1 year ago

GMW is disseminated with only one file per year for the entire world. get_resources fetches the whole data (~180Mo per available year) and converts the source data to geopackage for the whole world irrespective of the perimeter used to define the portfolio. The output geopackages are very heavy and the process takes ~3-4 minutes to complete for each year and the output file size is about 1GB for each year. With @lenaigmoign , we were ondering if it was not possible to spatially filter for the area of interest defined in the init_portfolio to focus the geopackage creation on this AOI.

goergen95 commented 1 year ago

From the current source, GMW data is distributed as one zip file per year containing a Shapefile. We convert this to Geopackage internally. Since you already will have downloaded the global data set, we made the decision to also convert at to Geopackage at the global level. This is the case for most vector resources currently integrated in the package (e.g. NASA FIRSMS).

goergen95 commented 1 year ago

Open to suggestions to change that behavior if it makes sure that always the correct extent is processes and does not require re-downloading the original data source.

goergen95 commented 1 year ago

Some more context: The idea is that you download resources once and maybe share between different portfolio analysis. If you change your extent only slightly, if there is missing data, the package will download this missing data and process it to the appropriate format. Currently, mainly because sources are distributed in very different ways in very different formats, the package takes an pragmatic approach to download the least amount of data necessary to match a given portfolio (e.g. in this case only the matching years are downloaded). The actual benefit comes if you store the resources in a common place and query different portfolios against it. In that case, for a later portfolio analysis none of the GMW data will have to be downloaded nor processed.

goergen95 commented 1 year ago

Using GDAL directly to translate to GeoPackage instead running through sf I see a reduction factor of about 2 in processing time:

library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE

url <- "https://wcmc.io/GMW_2010"
rundir <- file.path(tempdir(), "mapme")
dir.create(rundir)
zip <- file.path(rundir, "gmw.zip")
download.file(url, zip)

utils::unzip(
  zipfile = zip,
  exdir = rundir
)

shp <- list.files(rundir, "*.shp$", full.names = TRUE)

gpkg_sf <- file.path(rundir, "sf.gpkg")
gpkg_gdal <- file.path(rundir, "gdal.gpkg")

system.time({
  data <- read_sf(shp)
  write_sf(data, gpkg_sf)
})
#>    user  system elapsed 
#>  45.103   3.450  48.560

system.time({
  gdal_utils( util = "vectortranslate", shp, gpkg_gdal)
})
#>    user  system elapsed 
#>  25.068   1.908  26.991

unlink(rundir, recursive = TRUE, force = TRUE)

^{Created on 2023-06-21 with reprex v2.0.2}

goergen95 commented 1 year ago

The suggested change is merged in main. Would you mind updating me if you see any performance improvements?

fBedecarrats commented 1 year ago

Hi, thanks a lot @goergen95! There is some marginal improvement of the processing time with this modification, although it remains marginal (258 seconds to get_resource for 1 year, instead of 274 seconds with the package. Here is a reproducible example with the measured durations as comments:

# Test with the CRAN version

install.packages("mapme.biodiversity", quiet = TRUE)
library(mapme.biodiversity)
library(tidyverse)
library(wdpar)
library(sf)
library(tictoc)

if (!file.exists("data/WDPA/WDPA_Jun2023_SEN-shapefile.zip")) {
  WDPA_Senegal <- wdpa_fetch("Senegal", wait = TRUE, 
                             download_dir = "data/WDPA") 
} else {
  # Enregistrement serveur local 
  WDPA_Senegal <- wdpa_read("data/WDPA/WDPA_Jun2023_SEN-shapefile.zip") 
}

WDPA_mapme <- WDPA_Senegal %>%
  filter(st_geometry_type(.) != "MULTIPOINT") %>%
  st_cast("POLYGON")

WDPA_mapme <- init_portfolio(x = WDPA_mapme, 
                             years = 2016:2016,
                             outdir = "data/mapme_Senegal_vCRAN",
                             add_resources = TRUE,
                             verbose = TRUE)
tic()
WDPA_mapme <- get_resources(x = WDPA_mapme, resources = "gmw")
toc()
# 273.977 sec elapsed

detach("package:mapme.biodiversity", unload = TRUE)
remove.packages("mapme.biodiversity")

# now with dev version
new <- "https://github.com/mapme-initiative/mapme.biodiversity"
remotes::install_github(new, upgrade = "never")
library(mapme.biodiversity)

WDPA_mapme <- WDPA_Senegal %>%
  filter(st_geometry_type(.) != "MULTIPOINT") %>%
  st_cast("POLYGON")

WDPA_mapme <- init_portfolio(x = WDPA_mapme, 
                             years = 2016:2016,
                             outdir = "data/mapme_Senegal_vGithub",
                             add_resources = TRUE,
                             verbose = TRUE)
tic()
WDPA_mapme <- get_resources(x = WDPA_mapme, resources = "gmw")
toc()
# 257.642 sec elapsed

fBedecarrats commented 1 year ago

Some more context: The idea is that you download resources once and maybe share between different portfolio analysis. If you change your extent only slightly, if there is missing data, the package will download this missing data and process it to the appropriate format. Currently, mainly because sources are distributed in very different ways in very different formats, the package takes an pragmatic approach to download the least amount of data necessary to match a given portfolio (e.g. in this case only the matching years are downloaded). The actual benefit comes if you store the resources in a common place and query different portfolios against it. In that case, for a later portfolio analysis none of the GMW data will have to be downloaded nor processed.

I understand the approach and I agree. The difficulty is that we took the mangrove as the example for our meeting in Senegal because most of the researchers there work on coastal and marine protected areas. As we have ~10 years of data of GMW, the get_resource step takes too long (~1h to complete) to be compatible with the training timing. But I understand and agree that the approach chosen for this type of cases seems the most adequate for possibly repeated analysis purpose. For our training purpose, we will find a workaround by pre-loading the data on the machines used for the workshop. Many thanks!