mapme-initiative / mapme.biodiversity

Efficient analysis of spatial biodiversity datasets for global portfolios
https://mapme-initiative.github.io/mapme.biodiversity/dev
GNU General Public License v3.0
33 stars 7 forks source link

Performance issues on Kubernetes #90

Closed fBedecarrats closed 2 years ago

fBedecarrats commented 2 years ago

Hi there, and thanks again for this wonderful package! I'm preparing a simplified version on your mapme.protectedareas workflow for a forthcoming training workshop for young researchers in Madagascar. The work in progress (still at an early stage) is available here. To store and process the data, I am using SSP-Cloud, a Kubernetes instance provided by the French national institute for statistics (INSEE) that is open to data scientist working in the public administration. The Kubernetes server is accessible through an UI named Onyxia, that eases the process of managing the Kubernetes pods and the S3 buckets of an associated datalake. But it's only a wrapper and prepackaged containers for common uses (RStudio, Jupyter, Spark, Trino, Superset...). I use a RStudio pod and I can allocate as much CPU, Memory or storage resources as I need. The problem is that the processing of indicators freezes after a few minutes. As I process the data for one country only (Madagascar) and that I am able to allocate large CPU resources to my pod (equivalent to 30 CPUs if I understand correctly the parameters), I launch the calc_indicators() function with 12 or 16 CPUs and the processing time for is often estimated at <10 min for each an indicator. However, the process generally stalls at 8 to 12%. I generally end up slicing the honeycomb in 13 pieces (using split(aoi2, (seq(nrow(aoi2))-1) %/% 10000)) and running the calculation with 4 or 1 CPUs. It takes a long time but finally works. I would like to figure out where the problem comes from: does {mapme.biodiversity} use a parallelization method that is incompatible with Kubernetes? Did I misunderstand some Kubernetes parameters when setting my pods (I tried a lot of times with different parameters though)? Something else? I am looking for suggestions from {mapme.biodiversity} developers/maintainers on how to proceed to systematic tests to understand the origin of the problem and report it on a reliable manner so you can assess if it requires some improvement in the package.

goergen95 commented 2 years ago

Hi! Thanks for reporting. Seems like a very interesting use case. First of all, I do not know much about Kubernetes. Therefore, I would like to rule out some issues before. You mentioned that if you split up the portfolio and process each split using only 4 CPUs the calculations runs successfully. Could you make another run using the complete portfolio and more cores (e.g. 16). Then have a look at the tmpdir that you specified. I suspect that the initial polygons may fall onto the ocean thus they are processed quite fast and you get a 10 minute estimate. In the tempdir location, directories will be created with the name of the number of the polygons. Does it really stall there? Or does it simply take more time to compute the indicators for later polygons? If it really stalls, maybe you can inspect (or share) one of these polygons because so that we can investigate what might goes wrong there?

fBedecarrats commented 2 years ago

Hi Darius, I'll try to make a simplified example to make analysis easier. I'm launching a Kubernetes pod with RStudio server. I'm allocating CPU capacity with a guaranteed minimum at 22600m CPU (~22 CPUs) and a maximum set at 3000m (~30 CPUs). I also alocate a minimum of 20Go memory and a max at 60 Go. Screenshot from 2022-08-21 12-10-54 Now, I'm launching the following code:

# Install libs
remotes::install_github("mapme-initiative/mapme.biodiversity", force = TRUE,
                        upgrade = "always")
required_libs <- c( "dplyr", "tidyr", "sf", "wdpar", "tmap", "geodata", 
                    "tidygeocoder", "maptiles", "purrr", "mapme.biodiversity") 
missing_libs <- !(required_libs %in% installed.packages())
if(any(missing_libs)) install.packages(required_libs[missing_libs])
lapply(required_libs, require, character.only= TRUE)

# Get protected areas
PA_mada <- wdpa_fetch("Madagascar", wait = TRUE,
                      download_dir = "data_s3/WDPA") %>%
  st_transform(crs = "EPSG:29739") %>%
  filter(STATUS != "Proposed") %>%
  filter(DESIG != "Locally Managed Marine Area", DESIG != "Marine Park")

# honeycomb with hexagons of 5km2
PA_mada_box = st_as_sf(st_as_sfc(st_bbox(PA_mada)))
area_cell <- 5 * (1e+6)
cell_size <- 2 * sqrt(area_cell / ((3 * sqrt(3) / 2))) * sqrt(3) / 2
grid_mada <- st_make_grid(x = PA_mada_box,
                            cellsize = cell_size,
                            square = FALSE)

terrestrial_mada <- gadm(country = "Madagascar", resolution = 1, level = 0,
                         path = "data_s3/GADM") %>%
  st_as_sf() %>% 
  st_transform(crs = "EPSG:29739")

terrestrial_cells <- st_intersects(terrestrial_mada, grid_mada) %>%
  unlist()
grid_mada <- grid_mada[sort(terrestrial_cells)] %>%
  st_sf()

# set portfolio
dir.create("tmp")
grid_mada <- init_portfolio(x = grid_mada, 
                            years = 2000:2020,
                            outdir = "data_s3/mapme",
                            tmpdir = "tmp",
                            cores = 16,
                            add_resources = FALSE,
                            verbose = TRUE)

# Fetch distance matrix
grid_mada <-  get_resources(x = grid_mada, resource = "nelson_et_al",  
                              range_traveltime = "5k_110mio")
grid_mada <- calc_indicators(x = grid_mada,
                               "traveltime",  stats_accessibility = "mean",
                               engine = "extract")

It gets stuck at :

  |++                                                | 4 % ~05m 33s      

Subfolders have been created in the tmp directory (a subfolder "traveltime", with 3 subfolders inside: "4875", 4878" and "terra"), but nothing inside any of them. Now I suspect it might be an issue about writing rights in the temp file for the underlying process. So I just try to launch again, not secifying any tmp file to use the default one... And it works! Let me check with the other calc_indicator() calls...

Jo-Schie commented 2 years ago

Just as a very general comment. @Ohm-Np found out that there are no huge efficiency gains beyond 6-8 cores while parallelizing. So if you are not too much in a hurry and if you want to save a few bucks you can use a more modest machine.

fBedecarrats commented 2 years ago

OK, let me check on this install. I've launched the processing for the whole madagascar with 24 cores and the estimated time for processing is ~1h15. Yestedray I launched it with 4 cores (I think, I made so many intents, maybe it was just 1 core) and it took >6h. I'm launching another pod right now with the same specs to see if it is different. The resources I'm using are available for free and I don't think that there is a saturation risk on a Sunday.

fBedecarrats commented 2 years ago

I can confirm that the number of cores makes a difference on the Kubernetes platform I am using. Processing treecover_area_and_emissions on 120612 cells for Madagascar took 1h17 with cores = 24. I launched the very same pod with the same data and script, changing the number of cores to cores = 24 and the estimated processing time is estimated at ~3h15. It is still running for now, but I'll let you know the precise execution time when it is finished.

Jo-Schie commented 2 years ago

I can confirm that the number of cores makes a difference on the Kubernetes platform I am using. Processing treecover_area_and_emissions on 120612 cells for Madagascar took 1h17 with cores = 24. I launched the very same pod with the same data and script, changing the number of cores to cores = 24 and the estimated processing time is estimated at ~3h15. It is still running for now, but I'll let you know the precise execution time when it is finished.

Wow. That's pretty interesting. Seems like the Kubernetes approach then performs differently compared to a classical parallelization approach on a single VM. Wondering what the differences might be...

fBedecarrats commented 2 years ago

I can confirm that the number of cores makes a difference on the Kubernetes platform I am using. Processing treecover_area_and_emissions on 120612 cells for Madagascar took 1h17 with cores = 24. I launched the very same pod with the same data and script, changing the number of cores to cores = 24 and the estimated processing time is estimated at ~3h15. It is still running for now, but I'll let you know the precise execution time when it is finished.

Wow. That's pretty interesting. Seems like the Kubernetes approach then performs differently compared to a classical parallelization approach on a single VM. Wondering what the differences might be...

3 hours and 34 minutes with 8 cores, vs. 1 hour and 17 minutes with 24 cores. Computation time is about 3 times less with 3 times more cores. I am closing the issue, as the problem was with the temp folder apparently. Do not hesitate to reopen it or get back to me if you want me to run further tests on this installation. Many many thanks for the quick feedbacks (you souldn't work on Sunday though ;-).

fBedecarrats commented 1 year ago

Hi there. While running tests for #105 , I benchmarked the perfs with several cores. The code used as benchmark:

library(tictoc)
library(dplyr)
library(sf)
library(mapme.biodiversity)

my_shp <- "https://github.com/mapme-initiative/mapme.biodiversity/files/9746104/AP_Vahatra_shp.zip"
download.file(my_shp, destfile = "Vahatra98AP.zip")
unzip("Vahatra98AP.zip", exdir = "Vahatra")
PA_mada <- st_read("Vahatra/AP_Vahatra.shp", quiet = TRUE) %>%
  # Il manque la projection (pas de fichier .prj), on la spécifie à la main
  st_set_crs("EPSG:4326")

# Discard points and cast multipolygons as polygons
PA_poly <- PA_mada %>%
  st_repair_geometry() %>%
  filter(st_geometry_type(.) == "MULTIPOLYGON") %>%
  st_cast("POLYGON")

# Subset on small PAs
PA_poly_small <- PA_poly %>%  
  filter(hectars <= 1000)

# Test with 4 cores
PA_poly_small <- init_portfolio(x = PA_poly_small, 
                                  years = 2000:2020,
                                  outdir = "data_s3/mapme",
                                  cores = 4,
                                  add_resources = TRUE,
                                  verbose = TRUE)
# Get GFW data
PA_poly_small <- get_resources(x = PA_poly_small, 
                               resources = c("gfw_treecover", "gfw_lossyear", 
                                             "gfw_emissions"))
# With 8 cores (without portfolio initialization)
start_4 <- tic()
PA_poly_small <- calc_indicators(x = PA_poly_small,
                                 indicators = "treecover_area_and_emissions", 
                                 min_cover = 10, min_size = 1)
duration_4 <- toc()
# With 8 cores 
start_8 <- tic()
PA_poly_small_8 <- init_portfolio(x = PA_poly_small, 
                                  years = 2000:2020,
                                  outdir = "data_s3/mapme",
                                  cores = 8,
                                  add_resources = TRUE,
                                  verbose = TRUE)
PA_poly_small_8 <- calc_indicators(x = PA_poly_small_8,
                                   indicators = "treecover_area_and_emissions", 
                                   min_cover = 10, min_size = 1)
duration_8 <- toc()
# With 16 cores
start_16 <- tic()
PA_poly_small_16 <- init_portfolio(x = PA_poly_small, 
                                   years = 2000:2020,
                                   outdir = "data_s3/mapme",
                                   cores = 16,
                                   add_resources = TRUE,
                                   verbose = TRUE)
PA_poly_small_16 <- calc_indicators(x = PA_poly_small_16,
                                    indicators = "treecover_area_and_emissions", 
                                    min_cover = 10, min_size = 1)
duration_16 <- toc()
# With 24 cores
tic_24 <- tic()
PA_poly_small_24 <- init_portfolio(x = PA_poly_small, 
                                   years = 2000:2020,
                                   outdir = "data_s3/mapme",
                                   cores = 24,
                                   add_resources = TRUE,
                                   verbose = TRUE)
PA_poly_small_24 <- calc_indicators(x = PA_poly_small_24,
                                    indicators = "treecover_area_and_emissions", 
                                    min_cover = 10, min_size = 1)
duration_24 <- toc()

duration_4$callback_msg # Without init_portfolio icnluded in timer
duration_8$callback_msg
duration_16$callback_msg
duration_24$callback_msg

And the result Screenshot from 2022-10-14 14-48-40