r-spatial / rgeopackage

R package with helper tools in creating or handling GeoPackage files
GNU General Public License v3.0
10 stars 1 forks source link

Feature request: helper function for working with GeoPackage schema extension #4

Open elipousson opened 2 years ago

elipousson commented 2 years ago

I'm note sure if this is already possible using the layer_options and dataset_options for sf::st_write() but I'm curious if you've explored the possibility of adding support for the GeoPackage schema extension, ideally allowing column labels or a dataframe with definitions/aliases to match column names, to serve as the data to use for the schema. Happy to spend some time figuring this out if you have any pointers on how to approach it!

elipousson commented 2 years ago

I think it also may be possible to read the schema information from a GeoPackage file by using the options parameter for sf::st_read(). See https://github.com/r-spatial/sf/issues/1157 for more details.

florisvdh commented 2 years ago

Thanks for this interesting idea! It will first need exploring the different possible options how to best approach this, as you refer. I've no experience with the schema extension and won't be able to look at it soon, but will be glad to do so. It's added in my todo list, but of course you're welcome to take this further already or add a pull request.

elipousson commented 2 years ago

I don't have a GeoPackage file that is using the Schema or Metadata extensions to test this with but I've got a working prototype for reading the relevant tables using the RSQLite package:

Update: I developed this a little more fully and stuck the code in this gist. Writing the data may be easier than I expected although it will take some work to incorporate validation for the schema and metadata specifications. I'm assuming there is some way to do this with gdal_utils() and st_write() which may be preferable but I saw this package also used RSQLite so figured I wasn't too far off track.

read_gpkg_schema <- function(dsn, ...) {
  read_gpkg_extension(
    dsn,
    "gpkg_schema",
    c("gpkg_data_columns", "gpkg_data_column_constraints"),
    ...
    )
}

read_gpkg_metadata <- function(dsn, ...) {
  read_gpkg_extension(
    dsn,
    "gpkg_metadata",
    c("gpkg_metadata", "gpkg_metadata_reference"),
    ...
  )
}

read_gpkg_extension <- function(dsn, extension, table_name = NULL, ...) {
  conn <- RSQLite::dbConnect(RSQLite::SQLite(), dsn)

  check_gpkg_extension(conn, extension, table_name)

  purrr::map(
    table_name,
    ~ read_gpkg_table(conn = conn, .x)
  )
}

read_gpkg_table <- function(conn = NULL,
                            dsn = NULL,
                            table_name = NULL) {
  rlang::check_exclusive(conn, dsn)
  conn <- conn %||% RSQLite::dbConnect(RSQLite::SQLite(), dsn)

  if (is.null(table_name)) {
    cli::cli_abort(
      "{.arg table_name} must be provided." 
    )
  }

  cli::cli_inform("Accessing the {.val {table_name}} table.")
  RSQLite::dbReadTable(conn, table_name)
}

#' 
check_gpkg_extension <- function(conn,
                               extension = NULL,
                               table_name = NULL) {
  gpkg_extensions <- RSQLite::dbReadTable(conn, "gpkg_extensions")
  extension_tables <-
    gpkg_extensions[gpkg_extensions$extension_name %in% extension, ]

  has_tables <-
    table_name %in% extension_tables$table_name

  if (all(has_tables)) {
    return(invisible(NULL))
  }

  cli::cli_abort(
    "{.arg table_name} {.val {table_name[!has_tables]}} can't be found for extension {.val {extension}}."
    )
}
elipousson commented 2 years ago

For really no good reason, I spent a bunch more time on my fork this tonight both expanding on the examples I shared in my comment this afternoon and refactoring the original functions.

If you're OK with expanding the Imports to include rlang, glue, and cli and moving RSQLite from Suggests to Imports, I'd be glad to open a pull request (or open a separate more general issue for this proposed refactoring). I'd also love to see the verbose parameter used in the timestamp functions be replaced with a quiet parameter for consistency with sf.

I haven't fully figured out how to use parameterized SQL queries but, once I do, I think implementing both the registered extensions (which includes the metadata and schema extensions) and broader list of community extensions is totally doable.

elipousson commented 2 years ago

Ah – I did just spot the note on the README that rgeopackage is "not tied to any particular package by design." Let me know if you're set on this – if so, I may continue development on the fork but rename it to avoid any confusion – or if there is any flexibility.

florisvdh commented 2 years ago

Thanks!

not tied to any particular package by design

This essentially means that it is not necessarily built on top of one specific framework; hence some heterogeneity between functions is to be expected. No problem there for flexibility!

I will look further into your ideas and come back! Regarding adding dependencies (Imports), I think it depends on how frequently functions of these packages are used inside and across the new functions.

I'd also love to see the verbose parameter used in the timestamp functions be replaced with a quiet parameter for consistency with sf

Good point; feel free to change in a PR!

florisvdh commented 2 years ago

@elipousson I've read your gist. Such awesome contributions; this rocks!! They would make the currently modest, tiny package (just toying with timestamps, which I needed once) more reflect its name, i.e. providing various useful functions to interact with GeoPackages. Also I've been learning a bit further about GPKG extensions, thanks for the links :+1:.

As a historical note, rgeopackage was created after discussion in https://github.com/rkrug/rGeoPackage/issues/9. (At the time of writing, its Readme refers to an older repo location of current rgeopackage; have made a pull request for that)

I didn't yet study your new code in your fork, so still lagging behind :see_no_evil:. But I've taken a quick peek; it looks impressive and it's clear to me you take this very serious :heart:.

I must add I'm very busy these days (months actually) with non-coding work, which makes it challenging to keep pace with all exciting developments. Also, at the moment rgeopackage isn't my first priority when it comes to maintenance or adding new features. Given this, and your obvious and most welcome ambition to take the package to 'the next level' and perhaps add more ideas in the future, I'd be glad to pass maintainership to you if you'd be willing to, and go to 'assistance mode' where I can. Which doesn't mean I won't look in detail to your upcoming pull requests, on the contrary :smile:. Just please don't expect I can handle them very quickly, I hope I can get to it (or to further inspecting your fork, if not yet a PR) in the course of next week.

Looking forward to your ideas on this!

florisvdh commented 2 years ago

I'm assuming there is some way to do this with gdal_utils() and st_write() which may be preferable

Did you take a further look at those? It appears that GDAL supports the Schema and Metadata extensions, see https://gdal.org/drivers/vector/gpkg.html#level-of-support-of-geopackage-extensions. It is best to check what can be done with sf or not in this area, and document overlapping functionality, so that users are aware (similar to the case of preset_timestamp()).

elipousson commented 2 years ago

Thanks for the kind words about my code, @florisvdh, and the invitation to share in the development and stewardship of your package.

I'm going to try to get some additional read and write functionality working on my fork and add more tests before I open a pull request so it may be a week or two more before I have something to review. I mainly want to make sure I don't run into any issues that require significant restructuring (e.g. switching to use GDAL via sf rather than using RSQLite) and I think adding the tests per #5 should make code review easier. I've been doing package development for about two years but mostly working with APIs and sf objects so it is fun and different to learn how to access a GeoPackage file like a database!

florisvdh commented 2 years ago

OK, I'll see it when you get there. Feel free to add a branch and work here, but indeed the fork is still easier to experiment with GH Action workflows that depend on the main branch. Oh yeah, master should better be renamed as main. Shall I do that already?

elipousson commented 2 years ago

@florisvdh Renaming the branch sounds good to me! I haven't done this before but I'm assuming there is a way to make a pull request to bring my changes back into a development branch? Then bring them back into main one piece at a time so that I can add tests as I go?

florisvdh commented 2 years ago

Hi @elipousson main has now replaced master. In case you still had a master branch locally that tracks origin/master (with origin referring to r-spatial), you can run git fetch origin --prune, rename it as main (being checked out on master) with git branch -m main and update its tracking relationship by git branch -u origin/main. See e.g. output of git branch -vva:

$ git branch -vva
* main                                         66ea0fe [origin/main] Readme: update installation instruction
  remotes/origin/2022-11-15_development-branch 9f23f02 Merge pull request #9 from elipousson/master
  remotes/origin/main                          66ea0fe Readme: update installation instruction
florisvdh commented 2 years ago

Regarding the pull request strategy: since all work (different features) sits in one branch there are indeed two options:

The first approach will be the easiest one, but I'm open to both, depending on your preference.

elipousson commented 2 years ago

The first option definitely sounds easier to me as well. I'll see about pulling in the tests through a separate branch from main so we can take advantage of the GitHub action driven checks when I have a more full-featured pull request ready.

elipousson commented 1 year ago

I've been a little slow to dig back into development on this because I've thinking it may be better to use GDAL directly via sf::gdal_utils(), gdalUtilities::ogr2ogr() (which is a wrapper for sf::gdal_utils()), or or some combination of functions from {vapour} (although it is a read-only version of the API) rather than trying to work directly with the GeoPackage tables via {RSQLite}. I wish it were otherwise because it means I likely won't be able to use much of the code from my fork but the prospect of fully implementing the validation standards described in the OGC GeoPackage standard is really more than I can handle. Any thoughts on this change of approach?

elipousson commented 1 year ago

Ah – on reflection, it may not need to be an either-or proposition. I set up a new development branch and pulled some of the refactoring from my earlier fork/development branch (which uses {RSQLite}) but also set up a function that uses sf::st_read() to read the extension tables. I'm also looking at the sample SQL queries from the GeoPackage specification as a way of handling validation when adding a new extension to a GeoPackage file.

I'm going to focus on read and write functions for the schema and the metadata extensions and will share an update when I have working examples. It probably won't be a quick project but I think I'm starting to get my head around the options.

florisvdh commented 1 year ago

Interesting @elipousson! It's indeed best to use GDAL as backend in R where it already can cover what you want, to avoid duplication of effort and because it will be maintained to support future GPKG versions. Actually I wasn't yet aware of this being a change of approach, postponing reflection to the PR reviewing process :see_no_evil:.

GDAL support for aspatial tables in a GeoPackage has been implemented in sf, both reading and writing (https://github.com/r-spatial/sf/pull/1396). You may want to make use of it in programming the features that you want to add here. Also, SQL queries can be passed to GDAL with sf::st_read() IIRC.