njtierney / geotargets

Targets extensions for geospatial data
http://geotargets.njtierney.com/
Other
61 stars 4 forks source link

[draft] Implement SOZip storage of terra targets #62

Open brownag opened 7 months ago

brownag commented 7 months ago

This is a draft PR to implement an option (zipfile=TRUE) to write/read Seek-Optimized ZIP (SOZIP) files in the target store for #37. This works for tar_terra* SpatVector and SpatRaster methods,

Rather than attempting to create (non-SOZIP) ZIP files independently using utils::zip() or similar this PR uses one of two (simpler, but somewhat more limited) pathways available thru existing GDAL drivers.

Examples of the above cases have been added to tests.

Raster:

targets::tar_script({
    list(
        geotargets::tar_terra_rast(
            test_terra_rast2,
            terra::rast(system.file("ex/elev.tif", package = "terra")),
            gdal = c("STREAMABLE_OUTPUT=YES", "COMPRESS=NONE"),
            zipfile = TRUE
        ),
        geotargets::tar_terra_rast(
            test_terra_rast3,
            terra::rast(system.file("ex/elev.tif", package = "terra")),
            filetype = "GPKG",
            zipfile = TRUE
        )
    )
})
targets::tar_make()
#> ▶ dispatched target test_terra_rast2
#> ● completed target test_terra_rast2 [0.007 seconds]
#> ▶ dispatched target test_terra_rast3
#> ● completed target test_terra_rast3 [0.004 seconds]
#> ▶ ended pipeline [0.202 seconds]
#> Warning messages:
#> 1: /vsizip/{_targets/scratch/test_terra_rast2}/test_terra_rast2, band 1: Cannot modify metadata at that point in a streamed output file (GDAL error 6) 
#> 2: /vsizip/{_targets/scratch/test_terra_rast2}/test_terra_rast2, band 1: Cannot modify metadata at that point in a streamed output file (GDAL error 6) 
#> 3: /vsizip/{_targets/scratch/test_terra_rast2}/test_terra_rast2, band 1: Cannot modify metadata at that point in a streamed output file (GDAL error 6) 
#> 4: /vsizip/{_targets/scratch/test_terra_rast2}/test_terra_rast2, band 1: Cannot modify metadata at that point in a streamed output file (GDAL error 6)

targets::tar_read(test_terra_rast2)
#> class       : SpatRaster 
#> dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#> resolution  : 0.008333333, 0.008333333  (x, y)
#> extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> source      : test_terra_rast2} 
#> name        : elevation

targets::tar_read(test_terra_rast3)
#> class       : SpatRaster 
#> dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#> resolution  : 0.008333333, 0.008333333  (x, y)
#> extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> source      : test_terra_rast3} 
#> name        : Height 
#> min value   :    141 
#> max value   :    547

Vector:

targets::tar_script({
    lux_area <- function(projection = "EPSG:4326") {
        terra::project(
            terra::vect(system.file("ex", "lux.shp",
                                    package = "terra"
            )),
            projection
        )
    }
    list(
        geotargets::tar_terra_vect(
            test_terra_vect,
            lux_area()
        ),
        geotargets::tar_terra_vect(
            test_terra_vect_shz,
            lux_area(),
            filetype = "ESRI Shapefile"
        ),
        geotargets::tar_terra_vect(
            test_terra_vect_parquet_zip,
            lux_area(),
            filetype = "Parquet",
            zipfile = TRUE
        )
    )
})
targets::tar_make()
#> ▶ dispatched target test_terra_vect_parquet_zip
#> ● completed target test_terra_vect_parquet_zip [0.028 seconds]
#> ▶ dispatched target test_terra_vect
#> ● completed target test_terra_vect [0.054 seconds]
#> ▶ dispatched target test_terra_vect_shz
#> ● completed target test_terra_vect_shz [0.006 seconds]
#> ▶ ended pipeline [0.248 seconds]

targets::tar_read(test_terra_vect)
#>  class       : SpatVector 
#>  geometry    : polygons 
#>  dimensions  : 12, 6  (geometries, attributes)
#>  extent      : 5.74414, 6.528252, 49.44781, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : test_terra_vect
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1   NAME_1  ID_2   NAME_2  AREA   POP
#>  type        : <num>    <chr> <num>    <chr> <num> <int>
#>  values      :     1 Diekirch     1 Clervaux   312 18081
#>                    1 Diekirch     2 Diekirch   218 32543
#>                    1 Diekirch     3  Redange   259 18664

targets::tar_read(test_terra_vect_shz)
#>  class       : SpatVector 
#>  geometry    : polygons 
#>  dimensions  : 12, 6  (geometries, attributes)
#>  extent      : 5.74414, 6.528252, 49.44781, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : test_terra_vect_shz} (test_terra_vect_shz)
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1   NAME_1  ID_2   NAME_2  AREA   POP
#>  type        : <num>    <chr> <num>    <chr> <num> <int>
#>  values      :     1 Diekirch     1 Clervaux   312 18081
#>                    1 Diekirch     2 Diekirch   218 32543
#>                    1 Diekirch     3  Redange   259 18664

targets::tar_read(test_terra_vect_parquet_zip)
#>  class       : SpatVector 
#>  geometry    : polygons 
#>  dimensions  : 12, 6  (geometries, attributes)
#>  extent      : 5.74414, 6.528252, 49.44781, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : test_terra_vect_parquet_zip}
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1   NAME_1  ID_2   NAME_2  AREA   POP
#>  type        : <num>    <chr> <num>    <chr> <num> <int>
#>  values      :     1 Diekirch     1 Clervaux   312 18081
#>                    1 Diekirch     2 Diekirch   218 32543
#>                    1 Diekirch     3  Redange   259 18664

In the future, {gdalraster} (#48) could be used to assemble SOZip files--this would allow for sidecar files to be included (it appears they are not stored using /vsizip/ to write at this time, need to investigate), and drivers that do not support direct write of SOZip to be supported.

brownag commented 7 months ago

Apparently Parquet driver is not available in GDAL build being used on GHA, used FlatGeobuf instead which mostly works (appears to come back in different order; GDAL bug?)

Also apparently the snapshots have slight differences depending on runner. I don't have any snapshot changes to accept locally, so may need to relax these specific tests (as appears to have been done in other recent cases)


EDIT: GDAL >=3.5 is required for (Geo)Parquet (tests with it fail on GHA ubuntu latest), and >=3.8 for the v1.0.0 specification. https://gdal.org/drivers/vector/parquet.html

GDAL >=3.1 is required for FlatGeobuf https://gdal.org/drivers/vector/flatgeobuf.html

Aariq commented 1 month ago

For other formats, a more "generic" write and read function could be used:

write_to_zip <- function(object, path) {
    #rename path to not be confused with fs::path() just to make more readable
    out_path <- path
    #do stuff in a fresh local tempdir() that disappears when function is done
    tmp <- withr::local_tempdir()
    dir_create(tmp, fs::path_dir(out_path))
    #write the raster (hard-coded options for demonstration)
    writeRaster(object,
                fs::path(tmp, out_path),
                filetype = "GTiff",
                overwrite = TRUE)
    #figure out which files got written
    raster_files <- dir_ls(path(tmp, path_dir(out_path)))
    #package those into a zip file using `zip::zip()`
    zip::zip(
        path(tmp, fs::path_file(out_path)),
        files = fs::path_file(raster_files),
        compression_level = 1,
        root = fs::path_dir(raster_files)
    )
    #move the zip file to the out_path as expected output
    file_move(path(tmp, fs::path_file(out_path)), out_path)
}

read_from_zip <- function(path) {
    tmp <- local_tempdir()
    #extract into tempdir
    zip::unzip(zipfile = path, exdir = tmp)
    #read in as rast
    rast(fs::path(tmp, fs::path_file(path)))
}

I'm not sure how these compare to SOZip in terms of read and write overhead, but I assume they are worse.