Closed MichaelTiemannOSC closed 2 years ago
Please check my math here:
A data provider provisions a table T1 which can only be accessed by users with credential C1. They also create a notebook N1 which can be examined as open source software, but if run without credential C1, it cannot get access to anything in T1, which requires the C1 credential.
A user has a notebook U1 which can pass data via an Elyra pipeline to notebook N1. If N1 runs in a context where N1 has access to C1, N1 can accept data from U1, perform calculations that require data from T1, and return calculated results back to U1.
In the above, users can freely read the source code to N1, and users can activate N1 in a context that has C1 but they cannot directly access the running N1--it's a private process protected as much as credential C1 can be protected, running to produce whatever results it produces and return the results to U1. It is N1's role to ensure that U1 cannot exfiltrate data from N1. It is the Data Commons platform to ensure that only N1 with proper credentials can access data in T1.
If the data provider provisioning T1 agrees, they can declare that all derived results returned from N1 to U1 can be published as open data (for whatever that's worth).
Does that sound like what we are proposing to build?
Looking at the use-case:
Sounds to me like you agree with the general contours of the idea, and it's now up to us to prove we can partition functionalities so that developers have the full access they need, users get the access their tools need, and the open source community can see transparently both the lower-level source and the higher-level source, without having access to the data (except where granted specifically as publicly viewable data).
@MichaelTiemannOSC as discussed could we have a table list of data elements showing by column what belongs to black box / grey box / crystal box. We will use this for POC the access rights management layer with Trino settings.
Please see this notebook: https://github.com/os-climate/ITR/blob/develop-project-vault/examples/vault_demo_n0.ipynb
Which uses these extensions of the ITR class hierarchy: https://github.com/os-climate/ITR/blob/develop-project-vault/ITR/data/vault_providers.py
If we want to support a unit-aware ITR tool that uses the Data Vault, we must resolve https://github.com/os-climate/os_c_data_commons/issues/51
@MichaelTiemannOSC Looking at the demo notebook at https://github.com/os-climate/ITR/blob/develop-project-vault/examples/vault_demo_n0.ipynb I have a couple of questions:
Can you confirm what is defined as fundamental company data i.e. is it the entire content of company_data.fundamental_data table except company and company_id?
In step 5, you are trying to pass a list of company IDs to the Data Vault to get back a sum without exposing granular data. But the vault_warehouse you are using to get probability-adjusted temperature scores has been created on engine_dev - therefore in essence the process is able to access every bit of fundamental data with DEV access. Similarly vault_company_data used to get portfolio alignment temperature score based on corp fundamental data such as emissions or other weights is actually build on engine_dev. Do we have a problem with the use-case then?
Also in Step 4, the temperature_scores table does not seem to contain fundamental data, what are we trying to show in terms of data not being accessible here?
In this use case, fundamental company data is defined as the financial factors over time. Names and IDs of companies are treated as identifiers and not secret. Sectorization and regionalization could go either way (I thnk).
As for Step 5: we are using USER3 and engine_user to do the work, so it won't have engine_dev permissions when we actually implement the rules that Erik has recently enabled. engine_dev is responsible for creating quant- and user-accessible data. engine_quant is responsible for creating user-accessible data.
Also: acknowledge that within a single notebook, of course the user can reach in any touch any of the three engines, so best to read this in terms of how the engines separate the various concerns, not how the notebook itself does that. That will be next...
So I'm splitting up the notebook, and sure enough it doesn't break apart easily because this code:
vault_warehouse = DataVaultWarehouse(engine_dev, vault_company_data, vault_production_bm, vault_EI_bm)
is not something that can be executed as TRINO_USER2. The SQL database is persistent, which means that each user can get their own access to it. But now I have to figure out how to pickle/unpickle this object that is created by TRINO_USER1 and used by TRINO_USER3. Is pickling the right way to solve this problem?
The way I envisioned this would work is more by having each step / notebook generate intermediate tables and defining access rights necessary for the next level of data processing / notebook. Which means for example initial notebook producing all the necessary computation / weightage by Corporate Id for the next layer to process without access to fundamental data. Can this work?
Also another thought is we should have conventions in our doc for the pipeline notebooks - which should include at least, in the header, having the list of input data sources (schema / table names on Trino) and output data (schema / table names in Trino) with a description of access required / granted. It would make the process a lot easier for someone contributing downstream to the pipeline (and also for access configuration).
I really like the thought about the header. What would be really cool is if the textual header could be interpreted by a tool to produce a graphical diagram that shows the data and relationships and perhaps allows one to select a persona and see what direct or derived data is accessible (or not).
As for the separation issue: I just need to separate the actions of initialization (which requires the engine_dev) from use (which can construct a useable instance from an already-initialized system).
I the latest checkin on the develop-project-vault branch has the three users now operating in their separate notebooks. We are ready to turn on enforcement at the Trino layer to validate the data access management mechanisms.
Dependency on #119 #80
Dependency on #148
The first use-case we want to implement will be based on the new ITR data pipeline. To be developed once the access rights capabilities are available on cluster 2, as per above issue.
Use case was demonstrated and access rights were resolved on Trino .
Use case was demonstrated and access rights were resolved on Trino .
Apologies if this is a duplicate...
The ITR team have produced a schematic showing what data and calculations should be visible to which user roles. From the We need to determine how this schematic can be constructed as a set of rules for a Data Vault containing LSEG and Urgentem Data.
@joriscram @LeylaJavadova @caldeirav @HeatherAck @toki8