Data Vault ITR use case

MichaelTiemannOSC commented 3 years ago

Apologies if this is a duplicate...

The ITR team have produced a schematic showing what data and calculations should be visible to which user roles. From the We need to determine how this schematic can be constructed as a set of rules for a Data Vault containing LSEG and Urgentem Data. ITR-Prototype-Github

@joriscram @LeylaJavadova @caldeirav @HeatherAck @toki8

MichaelTiemannOSC commented 3 years ago

Please check my math here:

A data provider provisions a table T1 which can only be accessed by users with credential C1. They also create a notebook N1 which can be examined as open source software, but if run without credential C1, it cannot get access to anything in T1, which requires the C1 credential.

A user has a notebook U1 which can pass data via an Elyra pipeline to notebook N1. If N1 runs in a context where N1 has access to C1, N1 can accept data from U1, perform calculations that require data from T1, and return calculated results back to U1.

In the above, users can freely read the source code to N1, and users can activate N1 in a context that has C1 but they cannot directly access the running N1--it's a private process protected as much as credential C1 can be protected, running to produce whatever results it produces and return the results to U1. It is N1's role to ensure that U1 cannot exfiltrate data from N1. It is the Data Commons platform to ensure that only N1 with proper credentials can access data in T1.

If the data provider provisioning T1 agrees, they can declare that all derived results returned from N1 to U1 can be published as open data (for whatever that's worth).

Does that sound like what we are proposing to build?

caldeirav commented 3 years ago

Looking at the use-case:

C1 credential will be based on membership to ITR Developer group, but code for N1 will be accessible for anyone who has repository access.
Likely we use schema-level access to drive data access in the pipeline itself for transformed data, so the access is consistent across the whole pipeline for the group of developers. U1 does nor drive this (it is the execution context for U1 which should allow access to T1).
The access to the a separate distribution schema for end result can allow the data provider to ensure that specific derived results + potentially limited (attribute-level) access to source data is possible.

MichaelTiemannOSC commented 3 years ago

Sounds to me like you agree with the general contours of the idea, and it's now up to us to prove we can partition functionalities so that developers have the full access they need, users get the access their tools need, and the open source community can see transparently both the lower-level source and the higher-level source, without having access to the data (except where granted specifically as publicly viewable data).

caldeirav commented 3 years ago

@MichaelTiemannOSC as discussed could we have a table list of data elements showing by column what belongs to black box / grey box / crystal box. We will use this for POC the access rights management layer with Trino settings.

MichaelTiemannOSC commented 2 years ago

Please see this notebook: https://github.com/os-climate/ITR/blob/develop-project-vault/examples/vault_demo_n0.ipynb

Which uses these extensions of the ITR class hierarchy: https://github.com/os-climate/ITR/blob/develop-project-vault/ITR/data/vault_providers.py

MichaelTiemannOSC commented 2 years ago

If we want to support a unit-aware ITR tool that uses the Data Vault, we must resolve https://github.com/os-climate/os_c_data_commons/issues/51

caldeirav commented 2 years ago

@MichaelTiemannOSC Looking at the demo notebook at https://github.com/os-climate/ITR/blob/develop-project-vault/examples/vault_demo_n0.ipynb I have a couple of questions:

Can you confirm what is defined as fundamental company data i.e. is it the entire content of company_data.fundamental_data table except company and company_id?
In step 5, you are trying to pass a list of company IDs to the Data Vault to get back a sum without exposing granular data. But the vault_warehouse you are using to get probability-adjusted temperature scores has been created on engine_dev - therefore in essence the process is able to access every bit of fundamental data with DEV access. Similarly vault_company_data used to get portfolio alignment temperature score based on corp fundamental data such as emissions or other weights is actually build on engine_dev. Do we have a problem with the use-case then?

caldeirav commented 2 years ago

Also in Step 4, the temperature_scores table does not seem to contain fundamental data, what are we trying to show in terms of data not being accessible here?

MichaelTiemannOSC commented 2 years ago

In this use case, fundamental company data is defined as the financial factors over time. Names and IDs of companies are treated as identifiers and not secret. Sectorization and regionalization could go either way (I thnk).

As for Step 5: we are using USER3 and engine_user to do the work, so it won't have engine_dev permissions when we actually implement the rules that Erik has recently enabled. engine_dev is responsible for creating quant- and user-accessible data. engine_quant is responsible for creating user-accessible data.

Also: acknowledge that within a single notebook, of course the user can reach in any touch any of the three engines, so best to read this in terms of how the engines separate the various concerns, not how the notebook itself does that. That will be next...

MichaelTiemannOSC commented 2 years ago

So I'm splitting up the notebook, and sure enough it doesn't break apart easily because this code:

vault_warehouse = DataVaultWarehouse(engine_dev, vault_company_data, vault_production_bm, vault_EI_bm)

is not something that can be executed as TRINO_USER2. The SQL database is persistent, which means that each user can get their own access to it. But now I have to figure out how to pickle/unpickle this object that is created by TRINO_USER1 and used by TRINO_USER3. Is pickling the right way to solve this problem?

caldeirav commented 2 years ago

The way I envisioned this would work is more by having each step / notebook generate intermediate tables and defining access rights necessary for the next level of data processing / notebook. Which means for example initial notebook producing all the necessary computation / weightage by Corporate Id for the next layer to process without access to fundamental data. Can this work?

Also another thought is we should have conventions in our doc for the pipeline notebooks - which should include at least, in the header, having the list of input data sources (schema / table names on Trino) and output data (schema / table names in Trino) with a description of access required / granted. It would make the process a lot easier for someone contributing downstream to the pipeline (and also for access configuration).

MichaelTiemannOSC commented 2 years ago

I really like the thought about the header. What would be really cool is if the textual header could be interpreted by a tool to produce a graphical diagram that shows the data and relationships and perhaps allows one to select a persona and see what direct or derived data is accessible (or not).

As for the separation issue: I just need to separate the actions of initialization (which requires the engine_dev) from use (which can construct a useable instance from an already-initialized system).

MichaelTiemannOSC commented 2 years ago

I the latest checkin on the develop-project-vault branch has the three users now operating in their separate notebooks. We are ready to turn on enforcement at the Trino layer to validate the data access management mechanisms.

eoriorda commented 2 years ago

Dependency on #119 #80

eoriorda commented 2 years ago

Dependency on #148

caldeirav commented 2 years ago

The first use-case we want to implement will be based on the new ITR data pipeline. To be developed once the access rights capabilities are available on cluster 2, as per above issue.

eoriorda commented 2 years ago

Use case was demonstrated and access rights were resolved on Trino .

eoriorda commented 2 years ago

Use case was demonstrated and access rights were resolved on Trino .

os-climate / os_c_data_commons

Data Vault ITR use case #90