informatics-isi-edu / ermrest

ERMrest (rhymes with "earn rest") is a general relational data storage service for web-based, data-oriented collaboration.
Apache License 2.0
3 stars 5 forks source link

Catalog history tracking #146

Closed karlcz closed 7 years ago

karlcz commented 7 years ago

This issue was originally about logging of entity updates. It has since expanded scope to history capture for several related use cases. It is an umbrella task that will include quite a few related development activities...

Use Cases to Consider

Challenges

The main challenge for history capture in ERMrest revolves around its generic/introspective nature. We cannot assume that the model (i.e. the SQL DDL) is constant throughout the history. Our clients and our protocol depend on an understanding of the model that governs the data being queried or exchanged. Thus, we will have to capture a history of model changes as well as data changes.

Closely related to the model is fine-grained authorization policy and model annotations. We also need to be able to capture these and serve them up with the historical model and data, applying the appropriate history-relevant policy when deciding access rights on historical data. However, it is possible that policies for a project change, including retroactive changes in access rights to past data. This could be due to legal/human issues or simply to correct a technical flaw in a previously deployed policy. Because policies are tightly coupled to the model, it is not sufficient to just "use the latest policy". Rather, there needs to be a way to amend the policy that will be applied to a historical model, as an orthogonal problem to amending the latest policy that will be applied to the latest model!

Similarly, data redaction or history pruning involves amending historical data. The goal in both cases is to purge data from storage so that it is no longer retrievable and no longer resident in storage resources. Thus, such amendments are destructive and would obviously not support further "undo" or history-of-history tracking. An out-of-band recovery scheme would be needed to reverse such destructive changes, just as it is required now to reverse all changes in the absence of history capture. New web APIs probably need to be defined to allow amendment of historical content.

Technical Features and Tasks

The API extensions with strikethrough above are considered out of scope for an MVP release and should be added as later enhancements.

karlcz commented 7 years ago

I think this needs more in-depth use case and roadmap discussion before anything can be planned.

There are serious decisions to make about the intersection of this feature with fine-grained security, if this log content is going to be exposed to anybody but server admins.

Also, if there is any expectation to ask questions about the history of specific records, you quickly stray into temporal DB territory with all the implications for scaling and the meaning of history that spans across model versions.

carlkesselman commented 7 years ago

Lets discuss. We have to have some mechanism to understand how people are using the editing features within a data model.

Carl


Dr. Carl Kesselman Dean’s Professor, Epstein Department of Industrial and Systems Engineering Fellow, Information Sciences Institute Viterbi School of Engineering

Professor, Preventive Medicine Keck School of Medicine

University of Southern California 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695 Phone: +1 (310) 448-9338 Email: carl@isi.edumailto:carl@isi.edu Web: http://www.isi.edu/~carl

From: Karl Czajkowski notifications@github.com Reply-To: informatics-isi-edu/ermrest reply@reply.github.com Date: Friday, May 19, 2017 at 6:16 PM To: informatics-isi-edu/ermrest ermrest@noreply.github.com Cc: Carl Kesselman carl@isi.edu, Mention mention@noreply.github.com Subject: Re: [informatics-isi-edu/ermrest] Improve logging for entity updates (#146)

I think this needs more in-depth use case and roadmap discussion before anything can be planned.

There are serious decisions to make about the intersection of this feature with fine-grained security, if this log content is going to be exposed to anybody but server admins.

Also, if there is any expectation to ask questions about the history of specific records, you quickly stray into temporal DB territory with all the implications for scaling and the meaning of history that spans across model versions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/informatics-isi-edu/ermrest/issues/146#issuecomment-302841704, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADbjXqyxiRnbomOjIeccBCrg9aFVmfCkks5r7j8LgaJpZM4NCCnz. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/informatics-isi-edu/ermrest","title":"informatics-isi-edu/ermrest","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/informatics-isi-edu/ermrest"}},"updates":{"snippets":[{"icon":"PERSON","message":"@karlcz in #146: I think this needs more in-depth use case and roadmap discussion before anything can be planned.\r\n\r\nThere are serious decisions to make about the intersection of this feature with fine-grained security, if this log content is going to be exposed to anybody but server admins.\r\n\r\nAlso, if there is any expectation to ask questions about the history of specific records, you quickly stray into temporal DB territory with all the implications for scaling and the meaning of history that spans across model versions."}],"action":{"name":"View Issue","url":"https://github.com/informatics-isi-edu/ermrest/issues/146#issuecomment-302841704"}}}

carlkesselman commented 7 years ago

Now that fine grain access control is "done." can we revisit this issue and the best approach to move forward?

karlcz commented 7 years ago

Further discussion has raised the need for something more like a temporal/history access interface, which is quite different from audit/logging.

If a history mechanism would allow structured access to previous versions of records, etc, is there still a need to add detailed logging results from bulk operations? It seems to me that one could instead query into the history system if one needs that level of detail, so we don't necessarily want or need to replicate that level of detail into two stores?

karlcz commented 7 years ago

NOTE: while this branch is in development, multiple internal refactoring/restructuring tasks will be done. No attempt at backward-compatibility between commits will be attempted. Only upgrade from the previous master all the way to the final branch state will be supported. Hence, any pilot user must be prepared to drop and reload entire catalogs after any incremental commits in the branch.

karlcz commented 7 years ago

This branch now has read-only access to previous versions of the catalog for both schema and data retrieval. In general, any existing retrieval API like:

GET /ermrest/catalog/N/...

now has a complementary history API like:

GET /ermrest/catalog/N@revision/...

where revision is a URL-encoded timestamp identifying a transaction that created the referenced catalog state. For convenience, imprecise revision timestamps will be matched to the most recent revision which occurred at or before the requested timestamp.

carlkesselman commented 7 years ago

Perhaps at this point we can try working out a use case which is doing an RDA compliant identifier to a data collection.

Carl


Dr. Carl Kesselman Dean’s Professor, Epstein Department of Industrial and Systems Engineering Fellow, Information Sciences Institute Viterbi School of Engineering

Professor, Preventive Medicine Keck School of Medicine

University of Southern California 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695 Phone: +1 (310) 448-9338 Email: carl@isi.edumailto:carl@isi.edu Web: http://www.isi.edu/~carl

On Sep 1, 2017, at 3:16 PM, Karl Czajkowski notifications@github.com<mailto:notifications@github.com> wrote:

This branch now has read-only access to previous versions of the catalog for both schema and data retrieval. In general, any existing retrieval API like:

GET /ermrest/catalog/N/...

now has a complementary history API like:

GET /ermrest/catalog/N@revision/...

where revision is a URL-encoded timestamp identifying a transaction that created the referenced catalog state. For convenience, imprecise revision timestamps will be matched to the most recent revision which occurred at or before the requested timestamp.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/informatics-isi-edu/ermrest/issues/146#issuecomment-326695183, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADbjXjvR-b6lM0TDnKELweRjTPQXjAC7ks5seIJTgaJpZM4NCCnz.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/informatics-isi-edu/ermrest","title":"informatics-isi-edu/ermrest","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/informatics-isi-edu/ermrest"}},"updates":{"snippets":[{"icon":"PERSON","message":"@karlcz in #146: This branch now has read-only access to previous versions of the catalog for both schema and data retrieval. In general, any existing retrieval API like:\r\n\r\n GET /ermrest/catalog/N/...\r\n\r\nnow has a complementary history API like:\r\n\r\n GET /ermrest/catalog/N@revision/...\r\n\r\nwhere revision is a URL-encoded timestamp identifying a transaction that created the referenced catalog state. For convenience, imprecise revision timestamps will be matched to the most recent revision which occurred at or before the requested timestamp."}],"action":{"name":"View Issue","url":"https://github.com/informatics-isi-edu/ermrest/issues/146#issuecomment-326695183"}}}

karlcz commented 7 years ago

For the history amendment feature, which I think is a blocking feature for a production MVP, we need to think about the authorization model.

I think it will need a new access mode like amend which can appear in rights summaries and is distinct from normal update, delete, and insert rights which apply to the latest evolving catalog.

I also think we can limit ourselves to these amendment features:

I think we should disallow model changes in history amendment. If someone has a need to redact sensitive info that was encoded into the model structure itself, I think they have no recourse but to ETL sanitized content into a new catalog and destroy the old one.

For an initial MVP, I'd like to grant this amend right only to owners. But, do we need fine-grained amendment rights, i.e. you can tamper with history in one schema or table but not others? Or should we just grant it to the overall catalog owners who have holistic responsibilities for the content?

@carlkesselman @robes

karlcz commented 7 years ago

The history amendment apis are prototyped now but the test suite isn't covering them yet.

karlcz commented 7 years ago

There are now basic test cases for history apis. I think this feature is ready for more testing and review, but the corresponding PR will be broadened with some other closely related changes before it is ready to merge...