opengeospatial / geopackage

An asciidoc version of the GeoPackage specification for easier collaboration
https://www.geopackage.org
Other
264 stars 71 forks source link

My proposal for version 2 #335

Open benstadin opened 7 years ago

benstadin commented 7 years ago

Hi,

I'm about to convert our proprietary format used for our 3D map rendering engine into something interoperable. So that the data can be used to by the renderer, but also edited and visualized with different tools.

Thus I considered to base on GeoPackage. Though I need to define an extension for our projection-independent vector data tile format ("Heidelberg Sinusoidal Grid", HSG). For interoperability, I want to export a normal geopackage geometry table along our custom cells table (we call our tiles "cells" for a reason). In other words, the "cells" / tile tables are used for rendering while the normal geometry table will be used for interoperability. But I have some other doubts now that I digg deeper into the spec that Geopackage is the right choice for now.

Some areas which I think need be improved, although I realize it would break compatibility (thus probably unlikely it will ever be changed):

1. GeoPackage binary header

I understand the rationale to have some additonal geometry info, e.g. for indexing. But the GeoPackageBinaryHeader as it is is wasting lot's of space. Consider lots of of location coordinates stored as simple points, such as we get from analytics data where we analyze user flows. The GeoPackageBinaryHeader as it is now seems a pain to me for several reasons:

a) It wastes disk space: The "magic", "version", and "flags", and "srs_id" info should be stored once in the gpkg_geometry_columns table as info rather than for each individual object.

b) The srs_id info in the header is redundant. It is already defined in gpkg_contents.

c) Geopackage claims to base itself on available standards. Yet, the envelope which could equally be defined as a standard WKB or else is defines as an own array of coordinates, with another redundant info about endianess and own defintion of the envelope with flags what could equally be a WKB polygon.

d) Since the srs_id is defined in gpkg_contents it should apply to all geometries of the specified table. Again, having the srs_id in the header is redundant, and furthermore introduces potential errors: For example, the srs_id has to be changed in three places (gpkg_contents, gpkg_geometry_columns and individual geometries) and the behavior for geometries having a different srs_id than defined in the columns of contents table seems undefined. These could be interpreted either as erroneous entries which don't adhere to the specified srs_id, or as entries which have been added lately but weren't yet projected to the new srs_id, or it could be they are more "specific" e.g. overwriting the "default" srs_id with another (e.g. for geometries in a nearby UTM zone where you want certain geometries enforce a srs_id other than the "default"). Whatever is right here - it's overcomplicating implementation and the spec.

2. Multiple srs_id definitions

I've already described the srs_id issue above within the geometry scope. The srs_id is defined in three places in total:

As already described above in 1. it is not well designed this way. It's defined at multiple places, whereas the scope of validity of a srs_id defined should be the table scope.

But there is another thing to srs_id:

Missing elevation model

The specification says this about z values:

"The unit of measure for optional elevation(Z) values is determined by the CRS of the geometry; it is as-defined by a 3D CRS, and undefined for a 2D CRS."

The problem is that it in practice a combination of an elevation model and a CRS is required, because not all CRS define the elevation model. There is EPSG 4979, which is the 3D version of EPSG 4326. But many real world 3D applications and data require you to choose an elevation model. Having some UTM data with z-values, where the elevation is to be determined, is perfectly valid.

For an example, in the OpenStreetMap database z values can bes stored using WGS84 longitide / latitude and wgs84 ellipsoidal height, which they say is the most accurate combination [1].

But as you can imagine, the vast majority of data you get your hand on when mapping for example 3D building models has no wgs84 height info (also for the vast majority of OSM data elevation is not available).

Therefore it should be possible to have data with z values for WGS84 or UTM, and be able to for for example deine an elevation model as "local" or "wgs84". For rendering indoor buildings we use for example usually a UTM zone with "local" elevation [2, jump to 1:36m in the video]. Some way or another you just need to have a hint about what elevation model is with your data.

Proposal

I think the chances are very low that my proposal will be considered, since it would break compatibility somewhat. Nevertheless:

P1) It seems everybody is ok these days that there is only one geometry column allowed per table. It Seems totally acceptable, and it seems to me also that having multiple geometry columns would create more complexity. Therefore: Merge the definition of gpkg_geometry_columns with gpkg_contens, and remove gpkg_geometry_columns.

P2) Let gpkg_contens be the only place where the srs_id to be used by a table containing geometries is defined. Should be ok now, since we also stick to the one-geometry-per-table rule and got rid of gpkg_geometry_columns. Thus, let's also remove srs_id from gpkg_geometry_columns as well as from the geometry header (actually, I'd like to remove the geometry header completely, but let this be a seperate proposal). Also, remove srs_id from gpkg_contents and add srs_ref_id instead as a reference to the defintions in gpkg_spatial_ref_sys. See next point for rationale.

P3) Add a column "elevation_model" to gpkg_spatial_ref_sys with possible values of, just for example, "wgs84" and "local", and define that if it is null the elevation mode of the crs should be used (as I said, that often is not defined for many CRS). Also add an srs_ref_id column AUTOINC primary key. Let the gpkg_contens table refer to this srs_ref_id, and remove the srs_id from gpkg_contens. This is required because as described earlier there must be a combination possible to store elevation_model together with srs_id, so the content table refers to a row with all params instead if defining the srs_id directly. Also, this allows to use other codes than EPSG, and allows user-defined projection strings per table.

P4) Add the version info as it is currently defined in GeoPackageBinaryHeader to the gpkg_contens table for the respective geometry table. Also add a geometry_binary_type to gpkg_contens with the same definition as-is defined for GeoPackageBinary within the header. Also add a field envelope_column to gpkg_contens. This column will refer to the optional envelope be used for indexing (see below).

P5) Finally, we can get rid of the geometry wrapper StandardGeoPackageBinary: Let the geometry column of a feature table be a plain vanilla WKB binary (or more precisely one of the types defined in geometry_binary_type of gpkg_contents). Add a new but optional column "envelope" to the feature table as a replacement for the envelope within the header - let this also be a normal WKB binary. The envelope column is optional; if defined it must also be referred to in gpkg_contens.envelope_column.

P6) Write a small translation utility to convert between tables defined in the new "Geopackage v2" layout and the current 1.2 spec to ease applications the transition. Maybe this could be a SQL-script only solution.

I think this will be a much cleaner spec, and allow even more simple adoption of GeoPackage. For example Postgis does not yet have (or unknown to me) support to create a GeoPackage binary - but there will be no more need to. All that's required for an application to support GeoPackage "v2" is support for WKB or WKT.

I understand that this proposal will likely never get through. But to be honest, some of the points outlined above are so painful to us that I'd rather create a fork of the spec, base our software on it, and write a translation tool between this format and the original spec. So to keep us close to GPKG and ease translation and interoperability, and hope a future draft will improve on several of the points I mentioned.

As an additional info: The performance thing regarding rtree indexing is a non-issue. It's perfectly possible to use the envelope as I described above, and if really required to do something fancy things with SQLite like add auxilary data to the prepared statements or use an extension to hook into the table update. Having a wrapper object is not required at all from an implementation perspective if done right.

Best regards, Ben

[1] http://wiki.openstreetmap.org/wiki/Altitude [2] http://www.deep-map.com

pepijnve commented 7 years ago

@benstadin You're 100% correct that the header contains redundant information. This was not added without reason though. It's a consequence of limitations imposed by SQLite. User defined functions in SQLite only receive the parameters they're operating on as input (see http://sqlite.org/c3ref/create_function.html). Most spatial functions only receive one or two geometry blobs as input. Since a user defined function does not know which table and column the geometry blob comes from it cannot query gpkg_contents to get the srid. As a consequence geometry blobs need to be standalone objects in order to be able to implement useful spatial functions in SQLite.

Standard WKB is not an extensible encoding. There's no opaque 'user defined data' section in WKB geometries. I don't think there's a way we could have encoded the information that's in the geopackage geometry header in WKB. The closest thing I know of is PostGIS' EWKB, but that's not standard WKB as defined by OGC or ISO.

jyutzler commented 7 years ago

Please note that since this is a breaking change, it will not be considered until we wrap up 1.2.

benstadin commented 7 years ago

Thanks for the quick replies. I've created a "proof of concept", or it should be called a hack rather: https://github.com/benstadin/gpkg2poc

Just for illustration purpose of my proposal, I've created an extension function and added gpkg_version to the gpkg_contents table, and added the envelope column to the sample feature table.

The trick is that from the triggers (I added only "AFTER INSERT" for illustration) an extension function GPKGInsertTrigger(tableName, action, envelope) is called. So this passes the table name. Any related / required info (srs_id and gpkg_version) for this table is fetched from gpkg_contents on the fly.

The extension function caches these info so subsequent triggers for the same table will use this info. The actual rtree update should happen within this function, but was left away just because it should be obvious what's to be done there and wasn't possible to illustrate without adding boilerplate code.

In a concrete implementation there should be triggers to the gpkg_contents table also, so that the cache gets also updated whenever this table gets updated.

Notice that GPKGInsertTrigger() would not even have to care about whether it was passed the envelope or the whole object if those were of the same kind (e.g. WKB).

The only performance "bottleneck" (not really, though) in this simple implementation is that the lookup uses a string hash for the table info lookup. This could be easily optimized if the trigger SQL was created automatically, so an int ID value for the table could be used instead of the table name (as a side note, in practice unordered_map behaves as an array index in common implementations if the int values for a key are in range 0...50k, so using it should perfectly fine - it won't even use a hash function until then).

Sample output of the program:

Content info not yet cached, fetching info from gpkg_contents
Feature table changed: table = sample_features, action = INSERT, envelope = envelopeBinary1
srs_id = 4711, gpkg_version = 2
envelope = 'envelopeBinary1'

Reusing cached content info
Feature table changed: table = sample_features, action = INSERT, envelope = envelopeBinary2
srs_id = 4711, gpkg_version = 2
envelope = 'envelopeBinary2'
Program ended with exit code: 0
pepijnve commented 7 years ago

@benstadin apologies that I keep reiterating this, but I think it's kind of critical. As a proof-of-concept, try implementing ST_SRID and make it return the correct value or ST_Interacts where you fake the intersection logic, but do already detect SRID mismatch between the two geometries. (Links are to PostGIS docs, but these are actually defined in the ISO SQL specs).

Perhaps this is less important than I think it is, but one of the design goals I kept in mind back when I was working on the spec was to make it possible to implement all those spatial functions with the same signatures as they have in the ISO specs.