njtierney / geotargets

Targets extensions for geospatial data
https://njtierney.github.io/geotargets/
Other
57 stars 4 forks source link

using `geodata::gadm` with geotargets #22

Open njtierney opened 8 months ago

njtierney commented 8 months ago

geodata::gadm is commonly used to get shapefiles for countries at various levels of boundaries.

This issue is to talk about how to use gadm within geotargets - related to #5.

Some code to download a country shapefile might be something like:

get_gadm_country <- function(country = "Australia") {
    dir.create("data/shapefiles", recursive = TRUE, showWarnings = FALSE)
    geodata::gadm(
        country = country,
        level = 0,
        path = "data/shapefiles",
        version = "4.1",
        # low res
        resolution = 2
    )
}

My question is how should we recommend that users implement and use something like this in a targets workflow? At this stage I haven't tested this code out with the latest version of {geotargets} so it might all just work, but my specific concerns are around:

Also porting in @brownag 's comment from #5 in here as it is relevant:

In the case of get_gadm_country() you will get one .rds file in the directory path for each country specified. Each .rds file contains a wrapped SpatVector of a single country. When the cache exists for a particular country/version/resolution, rather than calling download.file() the SpatVector is read from file and unwrapped. Each of these could be "file" targets if you can determine what a particular call to gadm() should produce (i.e. 1 file with specific file name given parameters in path for each country specified)

After https://github.com/Aariq/geotargets/commit/f82edd47ababf7b073f388e87d4d5806cc747132 you can easily do something like tar_terra_vect(some_countries, get_gadm_country(c("Australia", "New Zealand"))) which will write the SpatVector combination of two .rds files as a new seek-optimized shapefile (.shz). If the source .rds files don't exist (i.e. on first run of the pipeline) or are deleted (i.e. a temp folder is used), they would be re-downloaded. This does not rely on external files outside the target store, but technically you have copies of the data in the initial gadm() download (stored as PackedSpatVector) and the target(s) (stored as zipped shapefile)

If you want to have the file(s) produced by gadm() be the targets... we need to consider that they are not shapefiles but rather PackedSpatVector in "rds" format... and a single call could produce multiple "rds" files (multiple targets)

You might need to have one format="file" target for each country of interest, possibly using dynamic branching or using a wrapper function around gadm() so you can identify what the file names are, rather than return the unwrapped SpatVector result in memory. EDIT: format="file" can be multiple file paths or directory, so the number of targets to use would depend on what you want to do with them

In general format="file" can probably be used for a variety of cases of geospatial data, especially when the user has some specific folder structure or process they need to follow that wouldn't work well out of the target store proper. Using format="file" does require some additional bookkeeping on the paths, filenames and types, though. https://books.ropensci.org/targets/data.html#external-files

mdsumner commented 7 months ago

(Following up a discussion with @njtierney) ... it's feasible to avoid explicit download with vsicurl

library(terra)
vect("/vsicurl/https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_CRI.gpkg", "ADM_ADM_1")
 class       : SpatVector
 geometry    : polygons
 dimensions  : 7, 11  (geometries, attributes)
 extent      : -87.10185, -82.55232, 5.49857, 11.21976  (xmin, xmax, ymin, ymax)
 source      : gadm41_CRI.gpkg (ADM_ADM_1)
 coord. ref. : lon/lat WGS 84 (EPSG:4326)
 names       :   GID_1 GID_0    COUNTRY     NAME_1 VARNAME_1 NL_NAME_1
 type        :   <chr> <chr>      <chr>      <chr>     <chr>     <chr>
 values      : CRI.1_1   CRI Costa Rica   Alajuela        NA        NA
               CRI.2_1   CRI Costa Rica    Cartago        NA        NA
               CRI.3_1   CRI Costa Rica Guanacaste        NA        NA
    TYPE_1 ENGTYPE_1  CC_1 HASC_1 ISO_1
     <chr>     <chr> <chr>  <chr> <chr>
 Provincia  Province     2  CR.AL  CR-A
 Provincia  Province     3  CR.CA  CR-C
 Provincia  Province     5  CR.GU  CR-G

That might provide a better ux. There are zipped GPKG with every country in them, but not as efficient to read with a query. I might make some suggestions to geodata. But another option is with CGAZ geoboundaries:

## there are ADM1 and ADM2 levels also 
dsn <- "/vsizip//vsicurl/https://github.com/wmgeolab/geoBoundaries/raw/main/releaseData/CGAZ/geoBoundariesCGAZ_ADM0.zip"
vect(dsn, query = "SELECT * FROM geoBoundariesCGAZ_ADM0 WHERE shapeGroup IN ('AUS','NZL')")

class       : SpatVector
 geometry    : polygons
 dimensions  : 2, 3  (geometries, attributes)
 extent      : -178.9278, 179.0695, -54.76208, -8.538777  (xmin, xmax, ymin, ymax)
 source      : geoBoundariesCGAZ_ADM0.zip
 coord. ref. : lon/lat WGS 84 (EPSG:4326)
 names       : shapeGroup shapeType   shapeName
 type        :      <chr>     <chr>       <chr>
 values      :        AUS      ADM0   Australia
                      NZL      ADM0 New Zealand
njtierney commented 6 months ago

Thanks for this @mdsumner !

I think that overall from this we can address this issue in a vignette, by detailing how we would encourage people to use geotargets for things like this.

Using the vsicurl approach here means users don't have to set a specific download file. So I think we could describe the ideal approach we would like users to use, demonstrate it, and potentially point to https://github.com/njtierney/demo-geotargets as a resource for people to look at for how this can be done?

Aariq commented 6 months ago

Ok, finally getting this as I'm using gadm() now for the first time. So, should I have one target that is the URL (e.g. tar_url(CRI_url, "https://geodata.ucdavis.edu/gadm/gadm4.1/gpkg/gadm41_CRI.gpkg") and a tar_terra_vect() that does something like vect(paste0("vsicurl/", CRI_url))? It does seem like it might be nice to have a target factory for this eventually.