os-climate / data-platform-demo

Apache License 2.0
3 stars 7 forks source link

Trino Vault demo needs a provider's perspective #40

Open MichaelTiemannOSC opened 2 years ago

MichaelTiemannOSC commented 2 years ago

For discussion

If this is not a correct understanding of what we are trying to do, let's sort that here.

It is a goal of the Vault to make it easy for providers to designate datasets for use by specific applications in specific ways: what tools have access to what tables, what calculations have access to what columns, and what user roles have access to what rows and what derived data based on data lineage.

For the provider-side demo, sample datasets will consist of rows of fictitious companies with various sample metrics. Queries of these metrics can return data lineage (i.e., for each data element, which table did the element come from, granted by what permission). Calculations based on elements producing derived data can also provide derived lineage. For example:

Providers can describe what columns are accessible to what calculations (which are somehow described in a consistent fashion between tool and provider). Column access so granted is part of the data lineage of access to data elements in that column.

Providers can describe what rows are accessible to what user roles (which are somehow described in a consistent fashion between tool and user authentication/authorization system). Row access so granted is part of the data lineage of access to data elements in that row.

Whatever rule first grants access to a given data element (i.e., a tool can mark a given calculation as public, a row-based rule permits access to all columns, or a column-based rule grants access to the column of an accessible row), that rule is the sole basis of the lineage for that element. This does not preclude the fact that successful access to multiple data elements as part of a calculation needs to either union the lineage of the source elements or declare its own lineage fact that it is authorized to issue for that calculation.

The purpose of this demonstration is two-fold:

  1. Demonstrate how providers can provision data and describe permissions granted to tools and user roles
  2. Demonstrate how data access management and data lineage provide both technical means to restrict and technical means to audit data access

@HeatherAck @LeylaJavadova