JOSS Review: Add small POC about Performance claims

Jo-Schie commented 1 year ago

JOSS Review criteria from the thread ask to assess whether there are any performance claims of the package that can be assessed. In your quick-start vignette, you state that:

"The wdpar R package can be used to clean large datasets assuming that sufficient computational resources and time are available. Indeed, it can clean data spanning large countries, multiple countries, and even the full global datatset (sic). When processing the full global dataset, it is recommended to use a computer system with at least 32 GB RAM available and to allow for at least one full day for the data cleaning procedures to complete"

Is it possible for you to provide a very short example that sustains that claim? I thought maybe you could randomly sample e.g. 500 areas from the global dataset and process them recording starttime and stoptime and report back. This would help to extrapolate and sustain the claim. I can also try to create a small example, but I never worked on the global data, so I guess it might be easier for you to do so.

I would not care too much about the efficiency of a geospatial software if it was designed for local scale analysis but the beauty of the wdpar package is, at least in theory, that one could create summary stats on the whole global progress towards achieving the AICHI targets (or the post-AICHI targets if e.g. the 30 by 30 goal is going to be approved by the global conservation community, - btw. something you could also mention in your use case description). So in that specific case, I would really like to see whether one could use the package for that.

jeffreyhanson commented 1 year ago

Yeah, I can set up a run, and then send you the code, log file, and run times.

jeffreyhanson commented 1 year ago

I've just done a run of the global database using the example R script distributed with the package (see https://github.com/prioritizr/wdpar/blob/master/inst/scripts/global-example-script.R). I've copied in the log file below and included the session information too. Since this was run on a server with 60 GB RAM, it's relatively fast because all the processing can be done in RAM without resorting to swap space. Let me know if you need any further details?

Log file

R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> # System command to execute:
> # R CMD BATCH --no-restore --no-save global-example-script.R
> 
> # Initialization
> ## define countries for processing data
> country_names <- "global"
> 
> ## define file path to save data
> path <- paste0(
+   "~/wdpa-data/global-", format(Sys.time(), "%Y-%m-%d"), ".gpkg"
+ )
> 
> ## load packages
> library(sf)
Linking to GEOS 3.10.2, GDAL 3.4.3, PROJ 8.2.0; sf_use_s2() is TRUE
> library(wdpar)
> 
> # Preliminary processing
> ## prepare folder if needed
> export_dir <- suppressWarnings(normalizePath(dirname(path)))
> if (!file.exists(export_dir)) {
+   dir.create(export_dir, showWarnings = FALSE, recursive = TRUE)
+ }
> 
> ## preapre user data directory
> data_dir <- rappdirs::user_data_dir("wdpar")
> if (!file.exists(data_dir)) {
+   dir.create(data_dir, showWarnings = FALSE, recursive = TRUE)
+ }
> 
> # Main processing
> ## download data
> raw_data <- wdpa_fetch(
+   country_names, wait = TRUE, download_dir = data_dir, verbose = TRUE
+ )
 [100%] Downloaded 194 bytes...
 [100%] Downloaded 1537392988 bytes...

Warning message:
In CPL_read_ogr(dsn, layer, query, as.character(options), quiet,  :
  GDAL Message 1: organizePolygons() received a polygon with more than 100 parts. The processing may be really slow.  You can skip the processing by setting METHOD=SKIP, or only make it analyze counter-clock wise parts by setting METHOD=ONLY_CCW if you can assume that the outline of holes is counter-clock wise defined
> 
> ## clean data
> result_data <- wdpa_clean(raw_data, erase_overlaps = FALSE, verbose = TRUE)
ℹ initializing
✔ initializing [36ms]

ℹ retaining only areas with specified statuses
✔ retaining only areas with specified statuses [17.9s]

ℹ removing UNESCO Biosphere Reserves
✔ removing UNESCO Biosphere Reserves [18.8s]

ℹ removing points with no reported area
✔ removing points with no reported area [18.1s]

ℹ wrapping dateline
✔ wrapping dateline [4m 48.2s]

ℹ repairing geometry
✔ repairing geometry [31m 16.9s]

ℹ reprojecting data
✔ reprojecting data [29.2s]

ℹ repairing geometry
✔ repairing geometry [10m 21s]

ℹ further geometry fixes (i.e. buffering by zero)
✔ further geometry fixes (i.e. buffering by zero) [6m 19.2s]

ℹ buffering points to reported area
✔ buffering points to reported area [48.8s]

ℹ repairing geometry
✔ repairing geometry [8m 8.8s]

ℹ snapping geometry to tolerance
✔ snapping geometry to tolerance [15s]

ℹ repairing geometry
✔ repairing geometry [11m 57s]

ℹ formatting attribute data
✔ formatting attribute data [50ms]

ℹ removing slivers
✔ removing slivers [13.1s]

ℹ calculating spatial statistics
✔ calculating spatial statistics [6.6s]

> 
> # Exports
> ## save result
> sf::write_sf(result_data, path, overwrite = TRUE)
> 
> proc.time()
    user   system  elapsed 
4583.412  201.009 4876.970

Session information

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] wdpar_1.3.3 sf_1.0-8   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9         magrittr_2.0.3     units_0.8-0        tidyselect_1.1.1  
 [5] R6_2.5.1           rlang_0.4.12       fansi_0.5.0        dplyr_1.0.7       
 [9] tools_4.1.2        grid_4.1.2         KernSmooth_2.23-20 utf8_1.2.2        
[13] e1071_1.7-11       DBI_1.1.3          ellipsis_0.3.2     class_7.3-19      
[17] assertthat_0.2.1   tibble_3.1.6       lifecycle_1.0.1    crayon_1.4.2      
[21] purrr_0.3.4        vctrs_0.3.8        glue_1.5.1         proxy_0.4-27      
[25] compiler_4.1.2     pillar_1.6.4       generics_0.1.1     classInt_0.4-7    
[29] pkgconfig_2.0.3

Jo-Schie commented 1 year ago

Sorry. Maybe I oversaw it...but how many Polygons did you process? This was not the global WDPA right?

Jo-Schie commented 1 year ago

Ah okay. Got it. It is the global data but without unionizing. Hummm. I thought your section on big data and processing overnight was also referring to unionizing ie ereasing overlaps. Did you ever do that for the global data?

If so, could there be a method for more or less benchmarking it? I know it gets more complex now...

jeffreyhanson commented 1 year ago

Yeah that's right. The resulting dataset contains 272,466 protected areas. Yeah, I have tried running the global data with erase_overlaps = TRUE and it doesn't work - the geometry processing dies due to (extremely) invalid geometries and I couldn't find a work around. To address this, the package documentation recommends using erase_overlaps = FALSE for large datasets (eg., https://prioritizr.github.io/wdpar/articles/wdpar.html#recommended-practices-for-large-datasets, https://prioritizr.github.io/wdpar/reference/wdpa_clean.html#recommended-practices-for-large-datasets-1) and provides advice for post-processing (e.g., using wdpa_dissolve() to take care of overlaps).

Jo-Schie commented 1 year ago

Seems fine to me. Thanks for the proof.

jeffreyhanson commented 1 year ago

Brilliant - thanks!

prioritizr / wdpar

JOSS Review: Add small POC about Performance claims #61