mapme-initiative / mapme.biodiversity

Efficient analysis of spatial biodiversity datasets for global portfolios
https://mapme-initiative.github.io/mapme.biodiversity/dev
GNU General Public License v3.0
33 stars 7 forks source link

Fix or improve broken download functionality for global datasets #84

Closed Jo-Schie closed 2 years ago

Jo-Schie commented 2 years ago

I found a rather serious bug today when I tried to download global forest watch data from a global portfolio. The cause of the download seems to be that the .get_gfw_lossyear creates a wrong bounding box when the areas cross the dateline.

I try to illustrate this with a couple of screenshots below.

The first screenshot shows tiles downloaded for a portfolio that consist of areas in Peru and Laos in Southeast Asia. The function downloads a lot of tiles but not the ones that are necessary.

Latin_America-Asia

Next Screeenshot illustrates what gets downloaded for a portfolio that consist of areas in Peru and Namibia.

Latin_America-Africa

In this case all necessary data is downloaded but also a bunch of useless data which consumes bandwidth, storage and a lot of time.

I am not sure why you are intersecting the bounding box of the AOIs instead of the AOIs themselves.

But if you were to remove the bounding box code (I did that in a new branch) then you get totally reasonable results see this image here:

Latin_America-Africa _ afterfix

Nevertheless, removing only works for lossyear data and forestcover data. The emissions somehow return and error if I remove it.

So before I go deeper into this I first wanted to ask @goergen95 if there was any particular reason why you intersect bounding boxes with the grids and not the actual AOIs... and maybe you could also give a quick comment which other functions are affected by this and whether my suggested quick-fix could solve them.

Here some example code to play around without data

# call libs
library(mapme.biodiversity)
library(tidyverse)
library(sf)
library(mapview)

dir.create("shared/johannes/mapme.biodiversity_test")
setwd("~/shared/johannes/mapme.biodiversity_test")

# load data
giz<-read_sf("../../datalake/mapme.protectedareas/input/wdpa_giz/wdpa_giz_supported_31-12-2019_2022-07-15.gpkg")

wdpa_geoms<-giz %>%
  filter(ISO3%in%c("LAO","ECU")) %>%
  st_cast(to="POLYGON")

wdpa_geoms<-giz %>%
  filter(ISO3%in%c("NAM","ECU")) %>%
  st_cast(to="POLYGON")

# initialize portfolio
wdpa_portfolio <- wdpa_geoms %>%
  init_portfolio(2000:2020,
                 cores = 1,
                 verbose = TRUE)

# download relevant data ressources and link them to the portfolio
wdpa_portfolio<-
  get_resources(wdpa_portfolio,
    resources = c("gfw_lossyear","gfw_treecover","gfw_emissions"), #"treecover2000","greenhouse",
    vers_treecover = "GFC-2020-v1.8",
    vers_lossyear = "GFC-2020-v1.8"
  )

# test download tiles
test<-read_sf("gfw_emissions/tileindex_gfw_lossyear.gpkg")
mapview(test) + mapView(wdpa_geoms)
Jo-Schie commented 2 years ago

Ok I just noticed that the quick-fix download improvement then creates issues in the subsequent processing functions in calc_indicators. Seems that the downloaded data is then not usable anymore, so the quickfix might end up being not so quick at all ;( ....

but anyways we should discuss this bounding box approach. What if somebody has a rather small portfolio that unfortunately spans over large part of the globe. Does he/she need to download the whole GFW data then...that is not user-friendly and I would argue that most probably it also slows down parts of the subsequent analysis flow considerably (e.g. the subsequent stitching and cropping part.)

goergen95 commented 2 years ago

Thanks for pointing out the issue with the dateline. I will have to dig into it to find a solution for this. Based on your screenshot above it looks like it exactly returns those tiles that do not fall into the bounding box (at least how I imagined it to work).

Concerning your question why we opted for the bounding box: its basically because of efficiency to get the tiles that intersect with a portfolio. If you imagine a portfolio consisting of thousands/millions of small areas that are located in one country or on one continent it is way more efficient to intersect with the bounding box rather than intersecting each polygon on its own. I see your point that for your second example some tiles are downloaded that are not strictly required for following calculations but that is however what we get using the bounding box approach (except for the bug with the dateline, of course).

Thus my suggestion would be to fix the dateline issue but still use the bounding box approach if it is feasible. In that way, this issue is also slightly related to #83 because we make designed the download functions in a way that it would only download additional files that to not intersect with the current portfolios extent (I will give more details over there in a minute).

Jo-Schie commented 2 years ago

Thanks for the clarifications. Well in the end then it comes down to preferences. Both approaches are inefficient in their own ways.

Maybe I will try some intersection scenarios tomorrow because AFAIK st_intersects is quite efficient compared to st_intersection. I personally think that the million AOI scenario is quite an edge case whereas global portfolios are more common.

Nevertheless, this question most probably also depends on implementation complexity since we do not have days to spend on this.

As for the dateline issue there is a function called wrap_dateline or something like this. I had also troubles with this in my hurricane data processing. I'll also have a look at that and report back.

goergen95 commented 2 years ago

This seems to be related to sf using s2 for geometric predicates on the sphere by default. I assume that the bounding box as well as the GFW grid are ill-defined assuming s2-space. When we switch it off we get the result we "expect". Not sure how to proceed here, because i am unsure what is considered "best-practice" when trying to intersect unprojected geometries.

library(sf)
#> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(mapme.biodiversity)

# create two polygons
poly1 <- st_as_sfc("POLYGON ((-76.00 -10.00, -76.00 -10.1, -76.1 -10.1, -76.1 -10.0))") |> st_as_sf()
poly2 <- st_as_sfc("POLYGON ((101.0 20.0, 101.0 20.1, 101.1 20.1, 101.1 20.0))") |> st_as_sf()
x <- rbind(poly1, poly2)
st_crs(x) <- st_crs(4326)
(bbox <- st_bbox(x))
#>  xmin  ymin  xmax  ymax 
#> -76.1 -10.1 101.1  20.1

grid_gfc <- mapme.biodiversity:::.make_global_grid(
  xmin = -180, xmax = 170, dx = 10,
  ymin = -50, ymax = 80, dy = 10
)

(sf_use_s2())
#> [1] TRUE
sf_use_s2(TRUE)
tile_ids <- st_intersects(st_as_sfc(bbox), grid_gfc)[[1]]
tiles <- grid_gfc[tile_ids, ]
plot(tiles)
plot(st_as_sfc(bbox), border = "red", add = T, lwd = 3)


sf_use_s2(FALSE)
#> Spherical geometry (s2) switched off
tile_ids <- st_intersects(st_as_sfc(bbox), grid_gfc)[[1]]
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar
tiles <- grid_gfc[tile_ids, ]
plot(tiles)
plot(st_as_sfc(bbox), border = "red", add = T, lwd = 3)

Created on 2022-08-11 by the reprex package (v2.0.1)

Jo-Schie commented 2 years ago

I tested st_intersects performance. 51 seconds for 1 Mio AOIs with simple geometries and 21 seconds for ~20.000 AOIs with very complex geometries and the grid from the function. From my point of view that is totally acceptable for these edge cases given that afterwards users save a lot of time with unnecessary downloads (downloading takes between 10-20 mins for the cases above with a very good bandwidth). So why not go into the direction I suggested erasing bbox from the function...?

goergen95 commented 2 years ago

Wow, that is some impressive timing! I was not aware of that, but in that case it totally makes sense to not use the bounding box. I pushed some more changes to the feature branch and prepared a small reprex showing that it works as expected:

library(sf)
#> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
remotes::install_github("mapme-initiative/mapme.biodiversity", ref = "improved_downloads")
#> Skipping install of 'mapme.biodiversity' from a github remote, the SHA1 (9001af73) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(mapme.biodiversity)

# create two polygons
poly1 <- st_as_sfc("POLYGON ((-76.00 -10.00, -76.00 -10.1, -76.1 -10.1, -76.1 -10.0, -76.00 -10.0))") |> st_as_sf()
poly2 <- st_as_sfc("POLYGON ((101.0 20.0, 101.0 20.1, 101.1 20.1, 101.1 20.0, 101.0 20.0))") |> st_as_sf()
x <- rbind(poly1, poly2)
st_crs(x) <- st_crs(4326)

dir.create("./test-data/tmp", recursive = TRUE)

(x |>
  init_portfolio(
    years = 2000:2020, 
    outdir = "./test-data", 
    tmpdir = "./test-data/tmp", 
    cores = 4, 
    add_resources = F) |>
  get_resources(c("gfw_treecover", "gfw_lossyear", "gfw_emissions")) |>
  calc_indicators(c("treecover_area_and_emissions")) -> y)
#> Argument 'vers_treecover' for resource 'gfw_treecover' was not specified. Setting to default value of 'GFC-2020-v1.8'.
#> Starting process to download resource 'gfw_treecover'........
#> Argument 'vers_lossyear' for resource 'gfw_lossyear' was not specified. Setting to default value of 'GFC-2020-v1.8'.
#> Starting process to download resource 'gfw_lossyear'........
#> Starting process to download resource 'gfw_emissions'........
#> Argument 'min_size' for resource 'treecover_area_and_emissions' was not specified. Setting to default value of '10'.
#> Argument 'min_cover' for resource 'treecover_area_and_emissions' was not specified. Setting to default value of '30'.
#> Simple feature collection with 2 features and 2 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -76.1 ymin: -10.1 xmax: 101.1 ymax: 20.1
#> Geodetic CRS:  WGS 84
#> # A tibble: 2 × 3
#>   assetid treecover_area_and_emissions                                         x
#>     <int> <list>                                                   <POLYGON [°]>
#> 1       1 <tibble [21 × 3]>            ((-76 -10, -76 -10.1, -76.1 -10.1, -76.1…
#> 2       2 <tibble [21 × 3]>            ((101 20, 101 20.1, 101.1 20.1, 101.1 20…

unlink("./test-data", recursive = TRUE)

Created on 2022-08-12 by the reprex package (v2.0.1)

Do you want to create a PR for this?

Jo-Schie commented 2 years ago

Tested it on Ubuntu (twice) and MacOS with the GFW data and after some system related hickups it worked on both. I'll create a pull request. Create work. Thanks @goergen95