Catalog history tracking

karlcz commented 7 years ago

This issue was originally about logging of entity updates. It has since expanded scope to history capture for several related use cases. It is an umbrella task that will include quite a few related development activities...

Use Cases to Consider

Audit: a privileged operator can see who did what and when
UX for "undo": previous version(s) of an entity or entities should be retrievable so that a client can formulate a new revision that restores a previous value that was incorrectly modified
Named/persistent data sets: a query URL should be able to refer to a snapshot of entity content as it existed at a certain time. Clients should be able to retrieve the data again (within limits)
- Policy revisions/corrections may change the visibility of historical data
- In some environments, data redaction options may be required to remove accidental content from the catalog. This is not just a policy change but a purge of actual stored content.
- In some environments, a history horizon might be imposed to limit retention of historical data.

Challenges

The main challenge for history capture in ERMrest revolves around its generic/introspective nature. We cannot assume that the model (i.e. the SQL DDL) is constant throughout the history. Our clients and our protocol depend on an understanding of the model that governs the data being queried or exchanged. Thus, we will have to capture a history of model changes as well as data changes.

Closely related to the model is fine-grained authorization policy and model annotations. We also need to be able to capture these and serve them up with the historical model and data, applying the appropriate history-relevant policy when deciding access rights on historical data. However, it is possible that policies for a project change, including retroactive changes in access rights to past data. This could be due to legal/human issues or simply to correct a technical flaw in a previously deployed policy. Because policies are tightly coupled to the model, it is not sufficient to just "use the latest policy". Rather, there needs to be a way to amend the policy that will be applied to a historical model, as an orthogonal problem to amending the latest policy that will be applied to the latest model!

Similarly, data redaction or history pruning involves amending historical data. The goal in both cases is to purge data from storage so that it is no longer retrievable and no longer resident in storage resources. Thus, such amendments are destructive and would obviously not support further "undo" or history-of-history tracking. An out-of-band recovery scheme would be needed to reverse such destructive changes, just as it is required now to reverse all changes in the absence of history capture. New web APIs probably need to be defined to allow amendment of historical content.

Technical Features and Tasks

[x] Repackaging to consolidate all catalog management SQL between service and deploy task
[x] Split "new catalog" deployment SQL from "upgrade catalog" redeployment SQL
[x] OID-based materialized model state and introspection system
[x] OID-relable mechanism to restore a dump (all OIDs change suddenly due to DBA action)
~~Selective OID-relable mechanism to replace individual tables or columns~~
[x] Use transaction timestamp instead of transaction ID for version tracking/ETag/cache coherence
[x] Use serializable isolation level to avoid partially ordered update scenarios
[x] Make incremental changes to materialized model instead of stateless rediscovery on service-generated model changes
[x] Standard row ID and temporal columns, e.g. RID/RCT/RMT on internal model-related tables
- Use RID as main FK strategy in model-related tables too
[x] Standard row ID and temporal columns, e.g. RID/RCT/RMT, on all tables
[x] Triggers to maintain system columns even for out-of-band or indirect changes.
[x] Clean up constraint introspection and related conventions, e.g. "on delete" and "on update" rules and disambiguation
[x] History storage tables and history capturing trigger
[x] Internal backend support for introspecting historical model instead of latest live model
[x] Internal ermpath support for querying historical tuples instead of latest live data
[x] API extension for versioned catalog access for model and data
~~API extension for entity history access, i.e. polymorphic, longitudinal representations~~
~~API extension for audit access, i.e. event-stream representations~~
[x] exclude historyless tables and views in historical model introspection
[x] API support for history amendment

The API extensions with ~~strikethrough~~ above are considered out of scope for an MVP release and should be added as later enhancements.

karlcz commented 7 years ago

I think this needs more in-depth use case and roadmap discussion before anything can be planned.

There are serious decisions to make about the intersection of this feature with fine-grained security, if this log content is going to be exposed to anybody but server admins.

Also, if there is any expectation to ask questions about the history of specific records, you quickly stray into temporal DB territory with all the implications for scaling and the meaning of history that spans across model versions.

carlkesselman commented 7 years ago

Lets discuss. We have to have some mechanism to understand how people are using the editing features within a data model.

Carl

Dr. Carl Kesselman Dean’s Professor, Epstein Department of Industrial and Systems Engineering Fellow, Information Sciences Institute Viterbi School of Engineering

Professor, Preventive Medicine Keck School of Medicine

University of Southern California 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695 Phone: +1 (310) 448-9338 Email: carl@isi.edumailto:carl@isi.edu Web: http://www.isi.edu/~carl

From: Karl Czajkowski notifications@github.com Reply-To: informatics-isi-edu/ermrest reply@reply.github.com Date: Friday, May 19, 2017 at 6:16 PM To: informatics-isi-edu/ermrest ermrest@noreply.github.com Cc: Carl Kesselman carl@isi.edu, Mention mention@noreply.github.com Subject: Re: [informatics-isi-edu/ermrest] Improve logging for entity updates (#146)

I think this needs more in-depth use case and roadmap discussion before anything can be planned.

There are serious decisions to make about the intersection of this feature with fine-grained security, if this log content is going to be exposed to anybody but server admins.

Also, if there is any expectation to ask questions about the history of specific records, you quickly stray into temporal DB territory with all the implications for scaling and the meaning of history that spans across model versions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/informatics-isi-edu/ermrest/issues/146#issuecomment-302841704, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADbjXqyxiRnbomOjIeccBCrg9aFVmfCkks5r7j8LgaJpZM4NCCnz. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/informatics-isi-edu/ermrest","title":"informatics-isi-edu/ermrest","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/informatics-isi-edu/ermrest"}},"updates":{"snippets":[{"icon":"PERSON","message":"@karlcz in #146: I think this needs more in-depth use case and roadmap discussion before anything can be planned.\r\n\r\nThere are serious decisions to make about the intersection of this feature with fine-grained security, if this log content is going to be exposed to anybody but server admins.\r\n\r\nAlso, if there is any expectation to ask questions about the history of specific records, you quickly stray into temporal DB territory with all the implications for scaling and the meaning of history that spans across model versions."}],"action":{"name":"View Issue","url":"https://github.com/informatics-isi-edu/ermrest/issues/146#issuecomment-302841704"}}}

carlkesselman commented 7 years ago

Now that fine grain access control is "done." can we revisit this issue and the best approach to move forward?

karlcz commented 7 years ago

Further discussion has raised the need for something more like a temporal/history access interface, which is quite different from audit/logging.

If a history mechanism would allow structured access to previous versions of records, etc, is there still a need to add detailed logging results from bulk operations? It seems to me that one could instead query into the history system if one needs that level of detail, so we don't necessarily want or need to replicate that level of detail into two stores?

karlcz commented 7 years ago

NOTE: while this branch is in development, multiple internal refactoring/restructuring tasks will be done. No attempt at backward-compatibility between commits will be attempted. Only upgrade from the previous master all the way to the final branch state will be supported. Hence, any pilot user must be prepared to drop and reload entire catalogs after any incremental commits in the branch.

karlcz commented 7 years ago

This branch now has read-only access to previous versions of the catalog for both schema and data retrieval. In general, any existing retrieval API like:

GET /ermrest/catalog/N/...

now has a complementary history API like:

GET /ermrest/catalog/N@revision/...

where revision is a URL-encoded timestamp identifying a transaction that created the referenced catalog state. For convenience, imprecise revision timestamps will be matched to the most recent revision which occurred at or before the requested timestamp.

carlkesselman commented 7 years ago

Perhaps at this point we can try working out a use case which is doing an RDA compliant identifier to a data collection.

Carl

Dr. Carl Kesselman Dean’s Professor, Epstein Department of Industrial and Systems Engineering Fellow, Information Sciences Institute Viterbi School of Engineering

Professor, Preventive Medicine Keck School of Medicine

University of Southern California 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695 Phone: +1 (310) 448-9338 Email: carl@isi.edumailto:carl@isi.edu Web: http://www.isi.edu/~carl

On Sep 1, 2017, at 3:16 PM, Karl Czajkowski notifications@github.com<mailto:notifications@github.com> wrote:

This branch now has read-only access to previous versions of the catalog for both schema and data retrieval. In general, any existing retrieval API like:

GET /ermrest/catalog/N/...

now has a complementary history API like:

GET /ermrest/catalog/N@revision/...

where revision is a URL-encoded timestamp identifying a transaction that created the referenced catalog state. For convenience, imprecise revision timestamps will be matched to the most recent revision which occurred at or before the requested timestamp.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/informatics-isi-edu/ermrest/issues/146#issuecomment-326695183, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADbjXjvR-b6lM0TDnKELweRjTPQXjAC7ks5seIJTgaJpZM4NCCnz.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/informatics-isi-edu/ermrest","title":"informatics-isi-edu/ermrest","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/informatics-isi-edu/ermrest"}},"updates":{"snippets":[{"icon":"PERSON","message":"@karlcz in #146: This branch now has read-only access to previous versions of the catalog for both schema and data retrieval. In general, any existing retrieval API like:\r\n\r\n GET /ermrest/catalog/N/...\r\n\r\nnow has a complementary history API like:\r\n\r\n GET /ermrest/catalog/N@revision/...\r\n\r\nwhere revision is a URL-encoded timestamp identifying a transaction that created the referenced catalog state. For convenience, imprecise revision timestamps will be matched to the most recent revision which occurred at or before the requested timestamp."}],"action":{"name":"View Issue","url":"https://github.com/informatics-isi-edu/ermrest/issues/146#issuecomment-326695183"}}}

karlcz commented 7 years ago

For the history amendment feature, which I think is a blocking feature for a production MVP, we need to think about the authorization model.

I think it will need a new access mode like amend which can appear in rights summaries and is distinct from normal update, delete, and insert rights which apply to the latest evolving catalog.

I also think we can limit ourselves to these amendment features:

redacting data by setting fields to NULL
- MAY violate ERM reference constraints or not-null constraints within the historical data
- should we automatically redact any referencing foreign keys if a key is redacted?
mutating data access ACLs and ACL bindings
mutating annotations

I think we should disallow model changes in history amendment. If someone has a need to redact sensitive info that was encoded into the model structure itself, I think they have no recourse but to ETL sanitized content into a new catalog and destroy the old one.

For an initial MVP, I'd like to grant this amend right only to owners. But, do we need fine-grained amendment rights, i.e. you can tamper with history in one schema or table but not others? Or should we just grant it to the overall catalog owners who have holistic responsibilities for the content?

@carlkesselman @robes

karlcz commented 7 years ago

The history amendment apis are prototyped now but the test suite isn't covering them yet.

karlcz commented 7 years ago

There are now basic test cases for history apis. I think this feature is ready for more testing and review, but the corresponding PR will be broadened with some other closely related changes before it is ready to merge...

informatics-isi-edu / ermrest