opengeospatial / geopackage

An asciidoc version of the GeoPackage specification for easier collaboration
https://www.geopackage.org
Other
264 stars 71 forks source link

Contents Table Id Column #516

Open bosborn opened 4 years ago

bosborn commented 4 years ago

This is a proposal (with low breaking risk to existing GeoPackages) to update the gpkg_contents table with an integer id primary key column.

The SQLite rowid column is not a reliable foreign key:

If the rowid is not aliased by INTEGER PRIMARY KEY then it is not persistent and might change. In particular the VACUUM command will change rowids for tables that do not declare an INTEGER PRIMARY KEY. Therefore, applications should not normally access the rowid directly, but instead use an INTEGER PRIMARY KEY.

VACUUM is a common function call used to reduce a GeoPackage memory footprint after modifications. Any foreign key to the contents rowid is not a guarantee.

The Related Tables Extension requires integer ids for relationships. The NGA Contents Id Extension is a work around to this issue.

Proposal:

Add an INTEGER id primary key to the gpkg_contents table and change the table_name column to be unique.

CREATE TABLE gpkg_contents (
  id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
  table_name TEXT NOT NULL UNIQUE,
  data_type TEXT NOT NULL,
  identifier TEXT UNIQUE,
  description TEXT DEFAULT '',
  last_change DATETIME NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
  min_x DOUBLE,
  min_y DOUBLE,
  max_x DOUBLE,
  max_y DOUBLE,
  srs_id INTEGER,
  CONSTRAINT fk_gc_r_srs_id FOREIGN KEY (srs_id) REFERENCES gpkg_spatial_ref_sys(srs_id)
);

Justification:

Migration SQL:

See ALTER TABLE: Making Other Kinds Of Table Schema Changes

-- Disable foreign key constraints
PRAGMA foreign_keys=OFF;

-- Start a transaction
BEGIN TRANSACTION;

-- Create a new contents table with id
CREATE TABLE new_gpkg_contents (
  id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
  table_name TEXT NOT NULL UNIQUE,
  data_type TEXT NOT NULL,
  identifier TEXT UNIQUE,
  description TEXT DEFAULT '',
  last_change DATETIME NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
  min_x DOUBLE,
  min_y DOUBLE,
  max_x DOUBLE,
  max_y DOUBLE,
  srs_id INTEGER,
  CONSTRAINT fk_gc_r_srs_id FOREIGN KEY (srs_id) REFERENCES gpkg_spatial_ref_sys(srs_id)
);

-- View old contents table columns (optional)
PRAGMA table_info(gpkg_contents);

-- View new contents table columns (optional)
PRAGMA table_info(new_gpkg_contents);

-- Transfer contents data
INSERT INTO new_gpkg_contents(id, table_name, data_type, identifier, description, last_change, min_x, min_y, max_x, max_y, srs_id)
    SELECT rowid, table_name, data_type, identifier, description, last_change, min_x, min_y, max_x, max_y, srs_id FROM gpkg_contents;

-- View mapping between new contents table and old contents table (optional)
SELECT N.table_name, N.id, O.rowid FROM new_gpkg_contents N, gpkg_contents O WHERE N.table_name = O.table_name;

-- View mapping between new contents table and old contents table where the rowids do not match (optional)
SELECT N.table_name, N.id, O.rowid FROM new_gpkg_contents N, gpkg_contents O WHERE N.table_name = O.table_name AND N.rowid != O.rowid;

-- View new contents max id / rowid value (optional)
SELECT MAX(id) FROM new_gpkg_contents;

-- View new contents table next autoincrement id value (optional)
SELECT * FROM sqlite_sequence WHERE name = 'new_gpkg_contents';

-- Drop the old contents table
DROP TABLE gpkg_contents;

-- Change the name of the new contents table
ALTER TABLE new_gpkg_contents RENAME TO gpkg_contents;

-- Verify that the schema change did not break any foreign key constraints
PRAGMA foreign_key_check;

-- Commit the transaction
COMMIT;

-- Enable foreign key constraints
PRAGMA foreign_keys=on;
jyutzler commented 4 years ago

This comment applies here as well.

jyutzler commented 4 years ago

The shadow view approach is complicated by the presence of multiple foreign keys that point to gpkg_contents. It will take some time to produce a viable example but I hope to do so during TB-16 work.

bosborn commented 4 years ago

The shadow view w/ triggers approach as the new "standard" specification could result in a fair amount of headaches.

The above proposal, however, is to update the base specification gpkg_contents table with an id column, causing little to no backwards compatibility issues.

jyutzler commented 4 years ago

I've gotten strong feedback from other stakeholders that adding new columns is to be avoided or we would have done this all over the place for the last five years.

bosborn commented 4 years ago

Understood, however this is a very special case due to the rowid alias already existing as explained above.

jyutzler commented 4 years ago

We discussed this during the SWG meeting today and there was general agreement that adding a column to this table would be resisted by the GeoPackage community because of risks to interoperability. However, we do want to know more about what is driving this requirement. What is your use case for needing a robust numerical ID for GeoPackage contents?

bosborn commented 4 years ago

The Related Tables Extension requires integer ids for relationships. The NGA Contents Id Extension is a work around to this issue.

The only way to foreign key to a Contents through a non changing identifier is by name. GeoPackage's own Related Tables Extension requires integer id's as foreign keys and is not compatible with Contents level relationships.

String keys in database design are generally a bad idea in order to avoid primary keys that may need to change over the lifetime of the data. It is a little late for that now, but the suggested column is already a supported due to the SQLite rowid. So again, the risks are fairly low compared to adding a new column that is not already inherently supported.

jyutzler commented 4 years ago

What is the use case where a Related Tables Extension relationship needs to be made between GeoPackage contents and related content? The Metadata Extension was designed to support this case.

bosborn commented 4 years ago

Any extension that wants to relate data on a Contents wide level in addition to a Contents data row level nature. The Metadata Extension suffers the same issue as the foreign key to an individual Contents would still be by table name, and not the metadata id. The Metadata extension is an added unnecessary layer when an extension can better define its' data and simply needs to reference / foreign key to a Contents. The Related Tables Extension is an already approved and supported way of performing this relationship, but requires integer ids.

jyutzler commented 4 years ago

The Metadata Extension suffers the same issue

This is false. gpkg_metadata_reference has a reference_scope column which may be "table", in which case table_name is as expected and row_id_value is null. There is actually less overhead to the Metadata Extension than the Related Tables Extension.

Any extension that wants to relate data on a Contents wide level in addition to a Contents data row level nature.

When does this happen? We (SWG members) are not aware of a real-world scenario that calls for this design.

bosborn commented 4 years ago

The Metadata Extension suffers the same issue

As in the foreign key is still to a String table_name value and not an integer. This seems true and confirmed by your comment.

When does this happen? We (SWG members) are not aware of a real-world scenario that calls for this design.

In our Contents Id and Feature Style extensions. http://ngageoint.github.io/GeoPackage/docs/extensions/

Contents Id you and I discussed last year. Styles minimally over twitter which uses Contents Id.

July 24th, 2019

Jeff Yutzler  1:14 PM Regarding http://ngageoint.github.io/GeoPackage/docs/extensions/contents-id.html IIUC, you don't need this because you can always use the invisible rowid column if the table lacks an integer primary key

Brian Osborn  1:19 PM oh wow nice, i didn’t even know about that i can probably phase that one out then

October 18th, 2019

Brian Osborn  1:28 PM was looking into replacing our contents id extension.  it doesn’t look like the rowid can be relied on “If the rowid is not aliased by INTEGER PRIMARY KEY then it is not persistent and might change. In particular the VACUUM command will change rowids for tables that do not declare an INTEGER PRIMARY KEY. Therefore, applications should not normally access the rowid directly, but instead use an INTEGER PRIMARY KEY.” https://www.sqlite.org/rowidtable.html October 19th, 2019

Jeff Yutzler  6:36 PM grrr... October 21st, 2019

Jeff Yutzler  8:04 AM I've thought about this a bit and it strikes me as a bit of an edge case. These rowids will only change when rows are removed (how often does that happen to gpkg_contents?) and the database is vacuumed. This problem would be caught by an executable test and is quickly repaired. Do what you want, but I would recommend not worrying about it.

Brian Osborn  8:06 AM About to fly, but yeah I thought about maybe even adding extra vacuum behaviors to handle it. But I know other groups including sofwerx that raw vacuum, and then the database is invalid. We do vacuums as well. Sqlite databases get bloated over time and it's the only way to reclaim memory. And that documentation really states to not rely on it, as if there could be other things either now or in the future. In hindsight the contents would have an actual ID column. Which would have made renaming things and foreign relationships easier to maintain If that info was lost in a vacuum there would be no way to fix those relationships

Jeff Yutzler  9:38 AM I do regret that the initial implementation did not have an actual ID column. It must have seemed like a good idea at the time...

jyutzler commented 4 years ago

I don't get your point regarding the Metadata Extension. There is no inherent flaw with the Metadata extension's use of table_name as a key. These keys cannot break like rowids can.

Portrayal is a much more complicated case. Work in the Vector Tiles Pilot and other activities has demonstrated that a simple style extension like you proposed is insufficient for broad use. There are too many holes. However, the VTP2 ER does not consider the potential brittleness of rowids because the test case did not feature arbitrary adding and deleting of gpkg_contents rows combined with vacuuming. (It is hard to image this happening in the real world.) Nevertheless, this is something we can look into during Testbed-16.

Do you have any other potential cases where you require a robust integer primary key in gpkg_contents?

bosborn commented 4 years ago

We can move past the Metadata Extension as that went off topic. There is nothing wrong with it. You had mentioned it, and I was simply pointing out that it also requires a table name foreign key (similar to contents). Usage of that Metadata Extension does not always make the most sense for some extensions.

VTP2 appears to foreign key to the table names as well, so row ids should not be an issue there.

I can assure you that deleting contents in a GeoPackage combined with vacuuming is a very real world operation, more so than assuming it would not happen. I realize that it becomes hard for each individual or organization to think broadly outside of their own use cases. A specification working group should try to keep all possibilities and use cases in mind for a global specification.

Whether our Styles extension is deemed insufficient by OGC for broad use should have no sway in core specification design. Our Styles extension is authored by our organization and has proven highly successful in operational environments. The years without an official GeoPackage style spec forced that need.

I currently don't have additional potential use cases, but I imagine some may arise in the future. In hindsight, the GeoPackage specification should have included an integer id column, a table name column, and a display name column. Only the integer id column should have been used as a foreign key throughout the specification. A display name and/or table rename would have been possible through a single contents row column update. Good database design principles encourage primary keys that have no chance of changing. The GeoPackage spec instead has made table names / data layer names fairly permanent without modifying throughout the entire database.

Getting back on track... this proposal is to provide a mechanism to add a proper reliable contents integer identifier in a backwards and forwards compatible manner. The need can justifiably be argued on proper database and specification design principles. If the SWG's response is no, then this issue can be closed.

In Summary:

fjlopez commented 4 years ago

I agree with the opinion of @bosborn. Reliable ID integer primary keys that identify the rows as he recommends are useful to increase the number of potential use cases of the GeoPackage specification. For example, we can imagine a versioning extension that documents the evolution of the database schema (e.g. renaming a feature, adding attribute tables, evolving from version 1 to version 2). With an ID in Contents it is easy to develop it: you may track the changes in the row with such ID, or if this ID is a primary key with AUTOINCREMENT you can save the contents of the old row and restore it when required. Without the ID, the development of such extension is more difficult as you should develop it considering corner cases related to the recycling of table names.

jyutzler commented 4 years ago

We're in violent agreement that mistakes were made 6-8 years ago. The question is twofold:

  1. Can the community be convinced to accept this change given the risks to interoperability? Consider all of the systems out there that may be expecting the table to have a very specific schema. We would have to conduct extensive testing to ensure we aren't breaking compliant systems. People aren't exactly standing by waiting to do this. If this proves to be a breaking change, then we're looking at GeoPackage 2.0 which would bifurcate the user base. This is not appealing from a strategic perspective.
  2. Can we develop a mechanism to provide equivalent behavior without breaking everything? This is being evaluated as part of Testbed-16.

Either way, this proposal is not on the table for 1.3. You will have to stick with your join table for the time being. This is not the worst thing in the world.

bosborn commented 4 years ago

Sounds good. For number 2, take a look at our Contents Id extension in case OGC would like to adopt it as a solution. It does nothing more than associate an integer id to a contents table name. http://ngageoint.github.io/GeoPackage/docs/extensions/contents-id.html

fjlopez commented 4 years ago

@bosborn May just adding a last id INTEGER UNIQUE column to gpkg_contents provide an equivalent behaviour without risking interoperability?

https://sqlite.org/faq.html#q26 explains that uniqueness check in SQLite is performed only against non-null values. So, adding this may solve this issue without breaking existing systems. Legacy software can use it as code does not require changes because the new column is filled with the default value, i.e. null. Nevertheless, I wonder if some software can complaint of having an additional id column in the result of select * from gpkg_contents, but, IMHO, this should not break a system. New software can use this id column to provide values to be used for example in RTE. This provides same functionality as Contents Id but the design is similar to the WKT extension.

create table t1(name string primary key not null, id integer unique);
insert into t1(name) values ("feature_1");
insert into t1(name) values ("feature_2");
insert into t1(name, id) values ("feature_3", 1);
insert into t1(name, id) values ("feature_4", 2);
-- insert into t1(name) values ("feature_4"); 
-- Error: UNIQUE constraint failed: t1.name
-- insert into t1(name, id) values ("feature_5", 2); 
-- Error: UNIQUE constraint failed: t1.id
select * from t1;
name id
feature_1 null
feature_2 null
feature_3 1
feature_4 2
bosborn commented 4 years ago

Yes, that would be an acceptable solution as well.

I think the SWG's goal is to try and avoid all new columns (from comment above and #230), so I'm not sure it can be done.

I do understand that dealing with optional core spec columns is annoying after personally experiencing it with the WKT extension (#254 & #255). It is tough trying to improve the spec while maintaining compatibility. One potential solution could be migration scripts, which is similar to how I have maintained operational databases in the past. It could preserve backwards compatibility in the same way you listed while also allowing maintained implementations to upgrade GeoPackages.

jyutzler commented 4 years ago

@fjlopez it is not legal to use ALTER TABLE ADD COLUMN to add a row with a primary key or unique constraint so a migration script would be needed either way. https://sqlite.org/lang_altertable.html

fjlopez commented 4 years ago

A change in the schema of a database always require a migration script even if the script only contains only one DDL statement. Migration scripts such as the included in the proposal of @bosborn https://github.com/opengeospatial/geopackage/issues/516#issue-605879696 are usual in the database world and there are tooling to support them (e.g. Flyway).

However, I understand your concern. SQL migration scripts are a headache for many users, specially users that are GIS end users.

I have a worry which is a bit off-topic but I would like to know your opinion as I think that it is very related to this. I start to believe that a root cause of the stiff opposition to schema evolution, even when the issues are clear, is the lack of a command line tool in our ecosystem able to automatise versions upgrades and other schema management issues, and really easy to use by the GIS community (similar to GDAL command line programs). E.g., a tool something like:

$ geopackage info test.gpkg 
Database: test.gpkg
Current version: GeoPackage 1.1.0
Extensions: Geometry Type Triggers
$ geopackage upgrade 1.3.0 test.gpkg
Validate migration
Extension Geometry Type Triggers not supported in 1.3.0
Upgrade cannot be performed, use upgrade  --force to upgrade and remove
$ geopackage upgrade --force 1.3.0 test.gpkg
Validate migration
Successfully migration to GeoPackage 1.3.0
$ geopackage info test.gpkg 
Database: test.gpkg
Current version: GeoPackage 1.3.0

Is there such a command line tool? I.e., something like the command line version of https://github.com/ngageoint/geopackage-java but with steroids.

jyutzler commented 4 years ago

@fjlopez There is just the Executable Test Suite which can be run on the command line. However, the ETS doesn't do a fraction of what we want it to do. It doesn't even support warnings at this time.

I don't have any answer for this. The software you are talking about will not fall from the sky or grow on trees. OGC does not have the resources to fund this development. Software organizations (you mention GDAL but there are others) have their own tools but they aren't vetted to be used in an official capacity. I don't see this changing unless a deep-pocketed organization decides it is necessary.

Regarding migration script, what I meant is that it would require a script to rename the table, create the new table, and copy the data over, not just add a new column with a single line like you could in most RDBMSs.

jyutzler commented 3 years ago

Tagging this as Tile Matrix Set because the shadow view approach will be considered as part of the TMS support development as shadow views will be required to handle the other aspects of that extension.