mapme-initiative / mapme.biodiversity

Efficient analysis of spatial biodiversity datasets for global portfolios

https://mapme-initiative.github.io/mapme.biodiversity/dev

GNU General Public License v3.0

33 stars 7 forks source link

get_resources("chirps") always download the global dataset, even for a tiny AOI #164

Closed fBedecarrats closed 1 year ago

fBedecarrats commented 1 year ago

Reprex:

library(tidyverse)
library(geodata)
library(sf)
library(mapme.biodiversity)
library(tmap)

communes_mada <- gadm("MDG", level=4, path = tempdir()) %>%
  st_as_sf() 

androy <- communes_mada %>%
  filter(NAME_4 == "Androy")

# ie. this is a very tiny area
tmap_mode("view")
tm_shape(androy) +
  tm_polygons()

aoi <- init_portfolio(androy, years = 2000:2021) %>%
  get_resources("chirps")

Current progress:

> aoi <- init_portfolio(androy, years = 2000:2021) %>%
+   get_resources("chirps")
Starting process to download resource 'chirps'........
  |++                                                | 3 % ~10h 29m 56s

As mentionned in the documentation, the package downloads monthly datasets. But these are 24Mo each:

From this source, some ligthter regional datasets are available for Africa and Indonesia, but not in cog format (tifs, pngs and bils).

Has someone some ideas about another spatially filtrable source for CHIRPS data?

goergen95 commented 1 year ago

25MB is not that large so the decision was made that the whole file is downloaded ~(though only for the time period requested)~. What is the issue? I see that the ETA is quite high indicating that either the source server is not quite responsive or your internet connection is really slow?

fBedecarrats commented 1 year ago

The issue is that there are more than 12 096 files to download (ie 1 per month since 1981) and that it takes > 10h with a good connexion. The total size is ~3Go. This is not excessive if we talk about long term global or cross-regional analysis, but it might be prohibitive for localized analysis, in particular when analysis reside in countries with challenging internet connexions.

goergen95 commented 1 year ago

I only see 489 files, but point taken. I see this related to the discussion about moving the package to a "cloud-native" solution. Current approach is that you have to download all resource locally. If we relied only on cloud-native data formats (e.g. COGs, GeoArrow and the likes) we really could query only the required data (though it might to be prohibitive for many polygons). The idea of downloading the global layer was that you would do it once and share it between projects even if the single projects might be very localized.

fBedecarrats commented 1 year ago

25MB is not that large so the decision was made that the whole file is downloaded (though only for the time period requested). What is the issue? I see that the ETA is quite high indicating that either the source server is not quite responsive or your internet connection is really slow?

As mentionned in the documentation, the package automatically downloads CHIRPS data since January 1981:

The data can be used to retrieve information on the amount of rainfall. Due to the availability of +30 years, anomaly detection and long-term average analysis is also possible. The routine will download the complete archive in order to support long-term average and anomaly calculations with respect to the 1981 - 2010 climate normal period. Thus no additional arguments need to be specified.

This is the behaviour we see in the reprex above. The portfolio has been set with years of insterest from 2000 to 2021, but the data is downloaded since 1981.

goergen95 commented 1 year ago

This is the behaviour we see in the reprex above. The portfolio has been set with years of insterest from 2000 to 2021, but the data is downloaded since 1981.

You are right. That is because we calculate precipitation anomalies down the line so we need a 30 year climate-normal period.

fBedecarrats commented 1 year ago

I see this related to the discussion about moving the package to a "cloud-native" solution.

Since I realized that TMF dataset was not a relevant candidate for my work (too few moist forest in Madagascar), I would like to prioritize this aspect. It would be nice to start specifying an overarching understanding of "cloud -native" solutions to be developped, so I'm not tempted to work on something that only works on the platform I work on (MinIO, which is an open source implementation of Amazon S3). I'll move this item to a specific discussion (#143).