mapme-initiative / mapme.biodiversity

Efficient analysis of spatial biodiversity datasets for global portfolios
https://mapme-initiative.github.io/mapme.biodiversity/dev
GNU General Public License v3.0
33 stars 7 forks source link

GFW Treecover returns NA instead of correct value on Windows machine #108

Closed karpfen closed 1 year ago

karpfen commented 1 year ago

I was getting some unexpected results when calculating GFW tree coverage and noticed that it only happens on my windows machine Here is a small example:

library(sf)
library(mapme.biodiversity)

faulty_poly <- st_as_sf(st_as_sfc(
"POLYGON ((-59.6354 5.0372, -59.3598 5.0372, -59.3598 5.3175, -59.6354 5.3175, -59.6354 5.0372))"
, crs = 4326
))

tree_cover <- faulty_poly %>%
  init_portfolio(2000:2022,
                 cores = 6,
                 verbose = TRUE) %>%
  get_resources(
    resources = c("gfw_treecover", "gfw_lossyear"),
    vers_treecover = "GFC-2020-v1.8",
    vers_lossyear = "GFC-2020-v1.8"
  )
  tc_area <- tree_cover %>%
  calc_indicators("treecover_area",
                  min_size = 1, min_cover = 1)

print(tc_area$treecover_area [[1]]$value)

When running this code, it prints a warning "Error : [classify] cannot overwrite existing file", which is produced by terra::mask in https://github.com/mapme-initiative/mapme.biodiversity/blob/main/R/calc_treecover_area.R Running terra::mask by itself works as expected, so I don't think the bug lies there.

The same code works as expected when I run it on a Linux machine. Even though I'm not 100% sure yet what causes this bug, maybe some other users can try and reproduce it on their machines to narrow it down.

Versions: R: 4.2.1, terra: 1.6-17, mapme.biodiversity: 0.2.1, sf: 1.0-8

goergen95 commented 1 year ago

Thanks for the repex! Potentially related to #105, I got a simillar error message (Error : [mask] source and target filename cannot be the same) on a Linux machine with the same version of packages. The error seems to come from using existing filenames in terra::mask and/or terra::classify for rasters with N > 1024*1024 cells. I can't really get my head around why it is sometimes working and sometimes it is not... But changing the source code should solve the problem. Would you like to prepare a PR?

goergen95 commented 1 year ago

Also, currently the 1024*1024 cells threshold is hard-coded in the GFW routines and not tested against with the unit tests. We need a smarter way to maybe interactively set this value so that we can set it down during tests. But this should not be part of init_portfolio() in order to not over-complicate things for users.

Jo-Schie commented 1 year ago

I can confirm the bug using both terra 1-6-7and terra 1-6-17. Can you elaborate a bit more @goergen95 what would need to be changed to prepare a PR? I guess the affected scripts are this and this and potentially this.

From my limited understanding about the inner workings of the package, I would guess that the problem is e.g. in this line here.

I would see potentially two options:

  1. remove the temporary (hardcoded) export of the masked raster. This would, from my understanding, force R to save the file to the tmp folder with a dynamically created file-name and thus avoid the error.
  2. create a dynamic filename at the time of export maybe using something like Sys.Time() or a random number generator or something like that.

Could you confirm @goergen95 that this could be a feasible approach? Or am I getting something totally wrong here?

Jo-Schie commented 1 year ago

by the way. Reading through the script I also saw that there are some commands called ifel ... I guess those should be ifelse ? At least in my R ifel is not a recognized function. So this might also cause disturbances...

see e.g. here and here and here

goergen95 commented 1 year ago

ifel() is an exported function of the {ŧerra} package, so no need to change that.

I suspect the general issue to occur when we use mask() or classify(), etc., on raster files that already exist on disk. I agree with your first option as a viable solution. The original intention here was to implement a memory safe but efficient processing of many assets in parallel. Depending on the available RAM on a given machine and the number of parallel threads used, our idea was to write rasters of a certain size to disk so that smaller rasters than this threshold get processed more efficiently in memory. This is currently set to a fairly random threshold of 1024x1024 pixels because we wanted to wait for the results of @Ohm-Np investigation on parallel processing to set sensible thresholds within the package. By removing the to_disk logic, we would simply let {terra} handle which files to write to temporary files and which are not. That is definitely a better option than the current solution where in some settings it causes bugs because the same filenames are used. However, it might not be the most efficient solution for the package's use case because {terra} uses its own heuristics to determine whether or not to write a certain file to disk. So in my view there are two options:

  1. Similar to your suggestion, just remove the to_disk logic and let {terra} handle the internals. Safe way to go to get rid of the bug, but maybe not the most efficient solution.
  2. Rework the logic of when files are written to disk making sure that no filename is used twice and incorporate an efficient allocation of resources based on available memory/threads per machine. This approach has the potential to result in more efficient calculations but requires some development efforts.
Jo-Schie commented 1 year ago

Thanks for your comment @goergen95. My interpretation of how terra internals work would have been, that anyway raster are stored to disk but normally in the tempfolder. AFAIK this is the main difference to raster where you would always run out of memory when operating with large rasters. So I am not really sure if this manual storage to disk is required anyway...(?). In that sense removing the line would probably not make any difference regarding memory and CPU usage.

As to your second suggestion. Also there I am unsure if it is really necessary to elaborate sophisticated memory and cpu handling routines. If we were to just change the file name to a clearly distinguishable file name then I guess there would be no differences to the current approach despite the filename...(?). In that sense all could stay the same.

Anyways we should probably just try it out, so maybe I'll just open up a branch for that.

Jo-Schie commented 1 year ago

okay. I created a quickfix please see PR #109

As mentioned above I am really not sure if we need the todisk routine for memory handling. Maybe some tests with larger data would shed a new light on this @karpfen. If so, then we would need to go for the second option outlined above. I think it would be enough then to work on this if we face the problem. The quickfix should enable users to keep on working with the package and we can just wait and see what happens.

Jo-Schie commented 1 year ago

@karpfen : Do you want to try the quickfix on your windows machine with the following repex. If it works you may close the issue and @goergen95 might trigger a bugfix update on CRAN. The following works on linux:

remotes::install_github("https://github.com/mapme-initiative/mapme.biodiversity")

library(sf)
library(mapme.biodiversity)

# dir.create("~/mapme.biodiversity.issues/108/")
setwd("~/mapme.biodiversity.issues/108/")

# unlink("~/mapme.biodiversity.issues/108/gfw_lossyear/", recursive=TRUE)
# unlink("~/mapme.biodiversity.issues/108/gfw_treecover/", recursive=TRUE)

faulty_poly <- st_as_sf(st_as_sfc(
  "POLYGON ((-59.6354 5.0372, -59.3598 5.0372, -59.3598 5.3175, -59.6354 5.3175, -59.6354 5.0372))"
  , crs = 4326
))

tree_cover <- faulty_poly %>%
  init_portfolio(2000:2022,
                 cores = 6,
                 verbose = TRUE) %>%
  get_resources(
    resources = c("gfw_treecover", "gfw_lossyear","gfw_emissions"),
    vers_treecover = "GFC-2020-v1.8",
    vers_lossyear = "GFC-2020-v1.8"
  )

tc_area <- tree_cover %>%
  calc_indicators("treecover_area",
                  min_size = 1,
                  min_cover = 1)

print(tc_area$treecover_area)

tc_area_emission <- tree_cover %>%
  calc_indicators("treecover_area_and_emissions",
                  min_size = 1,
                  min_cover = 1)

print(tc_area_emission$treecover_area_and_emissions)

tc_emission <- tree_cover %>%
  calc_indicators("treecoverloss_emissions",
                  min_size = 1,
                  min_cover = 1)

print(tc_emission$treecoverloss_emissions)
karpfen commented 1 year ago

Thanks for the explanations and quick fix, @Jo-Schie and @goergen95! I can confirm that with the fix in place, it works as expected for all three indicators on my Windows setup.