njtierney / geotargets

Targets extensions for geospatial data
https://njtierney.github.io/geotargets/
Other
49 stars 4 forks source link

`format_shapefile()` `utils::zip()` usage #5

Closed brownag closed 3 months ago

brownag commented 4 months ago

An alternative to utils::zip() may be to write shapefiles as .shz/.shp.zip extension and readily use seek-optimized ZIP

From https://gdal.org/user/virtual_file_systems.html#sozip-seek-optimized-zip

GDAL (>= 3.7) has full read and write support for .zip files following the SOZip (Seek-Optimized ZIP) profile.

The ESRI Shapefile / DBF and GPKG -- GeoPackage vector drivers can directly generate SOZip-enabled .shz/.shp.zip or .gpkg.zip files.

While this would allow for us to easily make use of seek-optimize ZIP, I think it still does not get us around the need to have the target object name end in ".shz" or ".zip" so that GDAL can detect the file on read. https://github.com/njtierney/geotargets/issues/4#issuecomment-1975587441

brownag commented 4 months ago

A reprex of how this can be implemented. Indeed there is no need to have the target end in .zip, nor for the calls to list.files()/utils::zip() https://github.com/njtierney/geotargets/blob/69a08fef5f82c8cb5bac1746ca163065ad84ade4/R/tar-shapefile.R#L20-L29

1) The target is written as a .shz file, then renamed to lose the extension (targets checks that there is a file with same name as R object) 2) The target is read using the curly braces notation, allowing GDAL to recognize the archive even though it lacks the .shz/.zip extension

targets::tar_script({
    library(terra)
    library(targets)

    lux_area <- function(projection = "EPSG:4326") {
        terra::project(terra::vect(system.file("ex", "lux.shp", package = "terra")),
                       projection)
    }

    format_shapefile_zip <- tar_format(
        read = function(path) terra::vect(paste0("/vsizip/{", path, "}")),
        write = function(object, path) {
            terra::writeVector(
                x = object,
                filename = paste0(path, ".shz"),
                filetype = "ESRI Shapefile"
            )
            file.rename(paste0(path, ".shz"), path)
        },
        marshal = function(object) terra::wrap(object),
        unmarshal = function(object) terra::unwrap(object)
    )

    tar_target(
        lux_area1,
        command = lux_area(),
        format = format_shapefile_zip
    )
})
targets::tar_make()
#> terra 1.7.73
#> ▶ dispatched target lux_area1
#> ● completed target lux_area1 [0.037 seconds]
#> ▶ completed pipeline [0.131 seconds]
targets::tar_load("lux_area1")
lux_area1
#>  class       : SpatVector 
#>  geometry    : polygons 
#>  dimensions  : 12, 6  (geometries, attributes)
#>  extent      : 5.74414, 6.528252, 49.44781, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : lux_area1} (lux_area1)
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1   NAME_1  ID_2   NAME_2  AREA   POP
#>  type        : <num>    <chr> <num>    <chr> <num> <int>
#>  values      :     1 Diekirch     1 Clervaux   312 18081
#>                    1 Diekirch     2 Diekirch   218 32543
#>                    1 Diekirch     3  Redange   259 18664
njtierney commented 4 months ago

That looks great to me, thanks @brownag !

How would one implement getting a shapefile via gadm?

Currently I've got a function written like so:

get_gadm_country <- function(country = "Australia") {
  dir.create("data/shapefiles", recursive = TRUE, showWarnings = FALSE)
  gadm(
    country = country,
    level = 0,
    path = "data/shapefiles",
    version = "4.1",
    # low res
    resolution = 2
  )
}

But this feels against the suggested workflow in that it requires creating a directory and specifying a path?

brownag commented 4 months ago

So, I think that in the case of get_gadm_country() you will get one .rds file in the directory path for each country specified. Each .rds file contains a wrapped SpatVector of a single country. When the cache exists for a particular country/version/resolution, rather than calling download.file() the SpatVector is read from file and unwrapped. Each of these could be "file" targets if you can determine what a particular call to gadm() should produce (i.e. 1 file with specific file name given parameters in path for each country specified)

After https://github.com/Aariq/geotargets/commit/f82edd47ababf7b073f388e87d4d5806cc747132 you can easily do something like tar_terra_vect(some_countries, get_gadm_country(c("Australia", "New Zealand"))) which will write the SpatVector combination of two .rds files as a new seek-optimized shapefile (.shz). If the source .rds files don't exist (i.e. on first run of the pipeline) or are deleted (i.e. a temp folder is used), they would be re-downloaded. This does not rely on external files outside the target store, but technically you have copies of the data in the initial gadm() download (stored as PackedSpatVector) and the target(s) (stored as zipped shapefile)

If you want to have the file(s) produced by gadm() be the targets... we need to consider that they are not shapefiles but rather PackedSpatVector in "rds" format... and a single call could produce multiple "rds" files (multiple targets)

You might need to have one format="file" target for each country of interest, possibly using dynamic branching or using a wrapper function around gadm() so you can identify what the file names are, rather than return the unwrapped SpatVector result in memory. EDIT: format="file" can be multiple file paths or directory, so the number of targets to use would depend on what you want to do with them

In general format="file" can probably be used for a variety of cases of geospatial data, especially when the user has some specific folder structure or process they need to follow that wouldn't work well out of the target store proper. Using format="file" does require some additional bookkeeping on the paths, filenames and types, though. https://books.ropensci.org/targets/data.html#external-files

Aariq commented 3 months ago

Let's move the discussion about gadm() to a new issue (if still relevant) and close this one, yeah?

njtierney commented 3 months ago

Yes I think let's close this issue and open a new one about gadm :)

njtierney commented 3 months ago

Also I should say, thank you @brownag for your explanation above! Much appreciated, :)