Closed brownag closed 8 months ago
A reprex of how this can be implemented. Indeed there is no need to have the target end in .zip, nor for the calls to list.files()
/utils::zip()
https://github.com/njtierney/geotargets/blob/69a08fef5f82c8cb5bac1746ca163065ad84ade4/R/tar-shapefile.R#L20-L29
1) The target is written as a .shz file, then renamed to lose the extension (targets checks that there is a file with same name as R object) 2) The target is read using the curly braces notation, allowing GDAL to recognize the archive even though it lacks the .shz/.zip extension
targets::tar_script({
library(terra)
library(targets)
lux_area <- function(projection = "EPSG:4326") {
terra::project(terra::vect(system.file("ex", "lux.shp", package = "terra")),
projection)
}
format_shapefile_zip <- tar_format(
read = function(path) terra::vect(paste0("/vsizip/{", path, "}")),
write = function(object, path) {
terra::writeVector(
x = object,
filename = paste0(path, ".shz"),
filetype = "ESRI Shapefile"
)
file.rename(paste0(path, ".shz"), path)
},
marshal = function(object) terra::wrap(object),
unmarshal = function(object) terra::unwrap(object)
)
tar_target(
lux_area1,
command = lux_area(),
format = format_shapefile_zip
)
})
targets::tar_make()
#> terra 1.7.73
#> ▶ dispatched target lux_area1
#> ● completed target lux_area1 [0.037 seconds]
#> ▶ completed pipeline [0.131 seconds]
targets::tar_load("lux_area1")
lux_area1
#> class : SpatVector
#> geometry : polygons
#> dimensions : 12, 6 (geometries, attributes)
#> extent : 5.74414, 6.528252, 49.44781, 50.18162 (xmin, xmax, ymin, ymax)
#> source : lux_area1} (lux_area1)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326)
#> names : ID_1 NAME_1 ID_2 NAME_2 AREA POP
#> type : <num> <chr> <num> <chr> <num> <int>
#> values : 1 Diekirch 1 Clervaux 312 18081
#> 1 Diekirch 2 Diekirch 218 32543
#> 1 Diekirch 3 Redange 259 18664
That looks great to me, thanks @brownag !
How would one implement getting a shapefile via gadm?
Currently I've got a function written like so:
get_gadm_country <- function(country = "Australia") {
dir.create("data/shapefiles", recursive = TRUE, showWarnings = FALSE)
gadm(
country = country,
level = 0,
path = "data/shapefiles",
version = "4.1",
# low res
resolution = 2
)
}
But this feels against the suggested workflow in that it requires creating a directory and specifying a path?
So, I think that in the case of get_gadm_country()
you will get one .rds file in the directory path
for each country specified. Each .rds file contains a wrapped SpatVector of a single country. When the cache exists for a particular country/version/resolution, rather than calling download.file()
the SpatVector is read from file and unwrapped. Each of these could be "file" targets if you can determine what a particular call to gadm()
should produce (i.e. 1 file with specific file name given parameters in path
for each country
specified)
After https://github.com/Aariq/geotargets/commit/f82edd47ababf7b073f388e87d4d5806cc747132 you can easily do something like tar_terra_vect(some_countries, get_gadm_country(c("Australia", "New Zealand")))
which will write the SpatVector combination of two .rds files as a new seek-optimized shapefile (.shz). If the source .rds files don't exist (i.e. on first run of the pipeline) or are deleted (i.e. a temp folder is used), they would be re-downloaded. This does not rely on external files outside the target store, but technically you have copies of the data in the initial gadm()
download (stored as PackedSpatVector) and the target(s) (stored as zipped shapefile)
If you want to have the file(s) produced by gadm()
be the targets... we need to consider that they are not shapefiles but rather PackedSpatVector in "rds" format... and a single call could produce multiple "rds" files (multiple targets)
You might need to have one format="file"
target for each country of interest, possibly using dynamic branching or using a wrapper function around gadm()
so you can identify what the file names are, rather than return the unwrapped SpatVector result in memory. EDIT: format="file"
can be multiple file paths or directory, so the number of targets to use would depend on what you want to do with them
In general format="file"
can probably be used for a variety of cases of geospatial data, especially when the user has some specific folder structure or process they need to follow that wouldn't work well out of the target store proper. Using format="file"
does require some additional bookkeeping on the paths, filenames and types, though. https://books.ropensci.org/targets/data.html#external-files
Let's move the discussion about gadm()
to a new issue (if still relevant) and close this one, yeah?
Yes I think let's close this issue and open a new one about gadm :)
Also I should say, thank you @brownag for your explanation above! Much appreciated, :)
An alternative to
utils::zip()
may be to write shapefiles as .shz/.shp.zip extension and readily use seek-optimized ZIPFrom https://gdal.org/user/virtual_file_systems.html#sozip-seek-optimized-zip
While this would allow for us to easily make use of seek-optimize ZIP, I think it still does not get us around the need to have the target object name end in ".shz" or ".zip" so that GDAL can detect the file on read.https://github.com/njtierney/geotargets/issues/4#issuecomment-1975587441