Open elipousson opened 2 years ago
I think it also may be possible to read the schema information from a GeoPackage file by using the options
parameter for sf::st_read()
. See https://github.com/r-spatial/sf/issues/1157 for more details.
Thanks for this interesting idea! It will first need exploring the different possible options how to best approach this, as you refer. I've no experience with the schema extension and won't be able to look at it soon, but will be glad to do so. It's added in my todo list, but of course you're welcome to take this further already or add a pull request.
I don't have a GeoPackage file that is using the Schema or Metadata extensions to test this with but I've got a working prototype for reading the relevant tables using the RSQLite package:
Update: I developed this a little more fully and stuck the code in this gist. Writing the data may be easier than I expected although it will take some work to incorporate validation for the schema and metadata specifications. I'm assuming there is some way to do this with gdal_utils()
and st_write()
which may be preferable but I saw this package also used RSQLite so figured I wasn't too far off track.
read_gpkg_schema <- function(dsn, ...) {
read_gpkg_extension(
dsn,
"gpkg_schema",
c("gpkg_data_columns", "gpkg_data_column_constraints"),
...
)
}
read_gpkg_metadata <- function(dsn, ...) {
read_gpkg_extension(
dsn,
"gpkg_metadata",
c("gpkg_metadata", "gpkg_metadata_reference"),
...
)
}
read_gpkg_extension <- function(dsn, extension, table_name = NULL, ...) {
conn <- RSQLite::dbConnect(RSQLite::SQLite(), dsn)
check_gpkg_extension(conn, extension, table_name)
purrr::map(
table_name,
~ read_gpkg_table(conn = conn, .x)
)
}
read_gpkg_table <- function(conn = NULL,
dsn = NULL,
table_name = NULL) {
rlang::check_exclusive(conn, dsn)
conn <- conn %||% RSQLite::dbConnect(RSQLite::SQLite(), dsn)
if (is.null(table_name)) {
cli::cli_abort(
"{.arg table_name} must be provided."
)
}
cli::cli_inform("Accessing the {.val {table_name}} table.")
RSQLite::dbReadTable(conn, table_name)
}
#'
check_gpkg_extension <- function(conn,
extension = NULL,
table_name = NULL) {
gpkg_extensions <- RSQLite::dbReadTable(conn, "gpkg_extensions")
extension_tables <-
gpkg_extensions[gpkg_extensions$extension_name %in% extension, ]
has_tables <-
table_name %in% extension_tables$table_name
if (all(has_tables)) {
return(invisible(NULL))
}
cli::cli_abort(
"{.arg table_name} {.val {table_name[!has_tables]}} can't be found for extension {.val {extension}}."
)
}
For really no good reason, I spent a bunch more time on my fork this tonight both expanding on the examples I shared in my comment this afternoon and refactoring the original functions.
If you're OK with expanding the Imports to include rlang, glue, and cli and moving RSQLite from Suggests to Imports, I'd be glad to open a pull request (or open a separate more general issue for this proposed refactoring). I'd also love to see the verbose
parameter used in the timestamp functions be replaced with a quiet
parameter for consistency with sf.
I haven't fully figured out how to use parameterized SQL queries but, once I do, I think implementing both the registered extensions (which includes the metadata and schema extensions) and broader list of community extensions is totally doable.
Ah – I did just spot the note on the README that rgeopackage
is "not tied to any particular package by design." Let me know if you're set on this – if so, I may continue development on the fork but rename it to avoid any confusion – or if there is any flexibility.
Thanks!
not tied to any particular package by design
This essentially means that it is not necessarily built on top of one specific framework; hence some heterogeneity between functions is to be expected. No problem there for flexibility!
I will look further into your ideas and come back! Regarding adding dependencies (Imports), I think it depends on how frequently functions of these packages are used inside and across the new functions.
I'd also love to see the verbose parameter used in the timestamp functions be replaced with a quiet parameter for consistency with sf
Good point; feel free to change in a PR!
@elipousson I've read your gist. Such awesome contributions; this rocks!! They would make the currently modest, tiny package (just toying with timestamps, which I needed once) more reflect its name, i.e. providing various useful functions to interact with GeoPackages. Also I've been learning a bit further about GPKG extensions, thanks for the links :+1:.
As a historical note, rgeopackage was created after discussion in https://github.com/rkrug/rGeoPackage/issues/9. (At the time of writing, its Readme refers to an older repo location of current rgeopackage; have made a pull request for that)
I didn't yet study your new code in your fork, so still lagging behind :see_no_evil:. But I've taken a quick peek; it looks impressive and it's clear to me you take this very serious :heart:.
I must add I'm very busy these days (months actually) with non-coding work, which makes it challenging to keep pace with all exciting developments. Also, at the moment rgeopackage isn't my first priority when it comes to maintenance or adding new features. Given this, and your obvious and most welcome ambition to take the package to 'the next level' and perhaps add more ideas in the future, I'd be glad to pass maintainership to you if you'd be willing to, and go to 'assistance mode' where I can. Which doesn't mean I won't look in detail to your upcoming pull requests, on the contrary :smile:. Just please don't expect I can handle them very quickly, I hope I can get to it (or to further inspecting your fork, if not yet a PR) in the course of next week.
Looking forward to your ideas on this!
I'm assuming there is some way to do this with
gdal_utils()
andst_write()
which may be preferable
Did you take a further look at those? It appears that GDAL supports the Schema and Metadata extensions, see https://gdal.org/drivers/vector/gpkg.html#level-of-support-of-geopackage-extensions. It is best to check what can be done with sf or not in this area, and document overlapping functionality, so that users are aware (similar to the case of preset_timestamp()
).
Thanks for the kind words about my code, @florisvdh, and the invitation to share in the development and stewardship of your package.
I'm going to try to get some additional read and write functionality working on my fork and add more tests before I open a pull request so it may be a week or two more before I have something to review. I mainly want to make sure I don't run into any issues that require significant restructuring (e.g. switching to use GDAL via sf rather than using RSQLite) and I think adding the tests per #5 should make code review easier. I've been doing package development for about two years but mostly working with APIs and sf objects so it is fun and different to learn how to access a GeoPackage file like a database!
OK, I'll see it when you get there. Feel free to add a branch and work here, but indeed the fork is still easier to experiment with GH Action workflows that depend on the main
branch. Oh yeah, master
should better be renamed as main
. Shall I do that already?
@florisvdh Renaming the branch sounds good to me! I haven't done this before but I'm assuming there is a way to make a pull request to bring my changes back into a development branch? Then bring them back into main one piece at a time so that I can add tests as I go?
Hi @elipousson main
has now replaced master
. In case you still had a master
branch locally that tracks origin/master
(with origin
referring to r-spatial), you can run git fetch origin --prune
, rename it as main
(being checked out on master
) with git branch -m main
and update its tracking relationship by git branch -u origin/main
. See e.g. output of git branch -vva
:
$ git branch -vva
* main 66ea0fe [origin/main] Readme: update installation instruction
remotes/origin/2022-11-15_development-branch 9f23f02 Merge pull request #9 from elipousson/master
remotes/origin/main 66ea0fe Readme: update installation instruction
Regarding the pull request strategy: since all work (different features) sits in one branch there are indeed two options:
main
beforehand (e.g. to get GH actions in place), you'll need to do that in commits that belong to another branch that directly branches off main
.main
. In order not to make things complex (e.g. using git rebase
to rearrange commits in branches directly based on main
, unless you're comfortable with that), this will require:
main
(or one of its parent commits) as direct parent of the first commit you want to be part of the PR. So the first PR would be the first X commits of the dev branch. To make such a PR, you need to define an extra branch name at the Xth commit; I can post the git commands if needed or define and push such a topic branch if you wish.The first approach will be the easiest one, but I'm open to both, depending on your preference.
The first option definitely sounds easier to me as well. I'll see about pulling in the tests through a separate branch from main
so we can take advantage of the GitHub action driven checks when I have a more full-featured pull request ready.
I've been a little slow to dig back into development on this because I've thinking it may be better to use GDAL directly via sf::gdal_utils()
, gdalUtilities::ogr2ogr()
(which is a wrapper for sf::gdal_utils()
), or or some combination of functions from {vapour}
(although it is a read-only version of the API) rather than trying to work directly with the GeoPackage tables via {RSQLite}
. I wish it were otherwise because it means I likely won't be able to use much of the code from my fork but the prospect of fully implementing the validation standards described in the OGC GeoPackage standard is really more than I can handle. Any thoughts on this change of approach?
Ah – on reflection, it may not need to be an either-or proposition. I set up a new development branch and pulled some of the refactoring from my earlier fork/development branch (which uses {RSQLite}
) but also set up a function that uses sf::st_read()
to read the extension tables. I'm also looking at the sample SQL queries from the GeoPackage specification as a way of handling validation when adding a new extension to a GeoPackage file.
I'm going to focus on read and write functions for the schema and the metadata extensions and will share an update when I have working examples. It probably won't be a quick project but I think I'm starting to get my head around the options.
Interesting @elipousson! It's indeed best to use GDAL as backend in R where it already can cover what you want, to avoid duplication of effort and because it will be maintained to support future GPKG versions. Actually I wasn't yet aware of this being a change of approach, postponing reflection to the PR reviewing process :see_no_evil:.
GDAL support for aspatial tables in a GeoPackage has been implemented in sf, both reading and writing (https://github.com/r-spatial/sf/pull/1396). You may want to make use of it in programming the features that you want to add here. Also, SQL queries can be passed to GDAL with sf::st_read()
IIRC.
I'm note sure if this is already possible using the layer_options and dataset_options for
sf::st_write()
but I'm curious if you've explored the possibility of adding support for the GeoPackage schema extension, ideally allowing column labels or a dataframe with definitions/aliases to match column names, to serve as the data to use for the schema. Happy to spend some time figuring this out if you have any pointers on how to approach it!