Closed christoferlof closed 2 years ago
Let's consider the workflow identified here: https://docs.microsoft.com/en-us/azure/architecture/example-scenario/ai/secure-compute-for-research. as an initial data export workflow.
The dataflow described is a pipeline/data supply chain with governance (i.e. contracts, licensing, etc) controlling ingress and egress.
Staging inputs and outputs can be considered an activity in a pipeline with different roles, responsibilities and tools to those used to produce the artefacts themselves.
Now, the primary resource within a TRE that provides governance boundaries is the "workspace".
One idea would be to consider ways to connect/orchestrate workspaces each with distinct ingress/egress policies governed and aligned with contracts/SLAs/agreements etc as appropriate. For example,
Ingress Staging Workspace (linking, pseudonymisation, de-identification, etc) -> ML Workspace (artefact production) -> Egress Staging Workspace (as above)
By using the workspace construct to flexibly provision environment and data controls (in the context of the principles of the 5 safes) we can create assurances that the workspace meets the legal requirements, etc.
A pipeline of workspaces would also start work towards federation of TREs.
Interesting way of thinking about it.
So everything is a data import "job" specify to a destination workspace.
Can all users request to send data to any workspace (maybe even in another TRE), or only to workspaces they are a member of?
There still needs to be an approval flow, do we have Data Reviewer
role (say Workspace Owner for now), who can see all artefacts that are pending approval to get into their workspace, and simply move it from a transfer location into their workspace shared storage should it be "approved"?
How does the approver then get it to the Researcher's outside location - say client machine? Is this a special case. Do they need to be able to download their approved exports?
So everything is a data import "job" specify to a destination workspace.
Yes, everything is a "process" or "job" as you say. This is useful for risk assessment and mitigation. It clearly identifies security zones with the ability to implement "controls" between processes. This could considering aspects of data de-identification and environment configuration for downstream processes.
Can all users request to send data to any workspace (maybe even in another TRE), or only to workspaces they are a member of?
That's a policy decision depending on the data rights but it should be the "Workspace Owner" responsibility (or their delegate.). I think it's important recognise that the flow of data is also a flow of ownership and rights that constrain downstream processing.
In terms of defining workspace connectivity, I'd suggest that a workspace owner would specify a whitelist of downstream workspaces that are allowed to request artefacts.
Workspace owners should not need to be members of downstream workspaces. I think it should be the other way around with downstream "Workspace Owners" or external processes requesting upstream artefacts (pull).
There still needs to be an approval flow, do we have Data Reviewer role (say Workspace Owner for now), who can see all artefacts that are pending approval to get into their workspace, and simply move it from a transfer location into their workspace shared storage should it be "approved"?
Yes, the artefact requests will need to be approved and it could be the workspace owner for workspace processes or delegate to some "Data Reviewer".
If a workspace request is to an upstream service outside of the current TRE (e.g. NHS provider) it would follow the same sort of protocol. That's how it happens today.
I am concerned with the assumption of "data moving/copied" through workspaces. We need to be careful about that. The assumption that data is copied through workspaces processes may be not be feasible for larger datasets. However, with a supply chain model it will be possible to deploy workspaces next to the data without moving the data to the workspace. There's also the whole world of data virtualisation options in relation to TREs and workspaces. What we should be doing is managing to the flow of rights to access data independent of the data tech if possible
How does the approver then get it to the Researcher's outside location - say client machine? Is this a special case. Do they need to be able to download their approved exports?
I'm not sure what you mean. Some workspaces may under certain conditions should allow for artefacts to be exported from the TRE.
We should also remember that the processing of data is undertaken in the context of a risk management process. Take a look at the diagram below that outlines the trust, dependancy and rights relationships between processes. In the example, we have
What's important is the explicit injection of risk management into processes to allow for points of governance and stewardship, along with an overarching model of risk
This is one flow and there are primitives (resource sharing, resource encapsulation, etc.) that we can consider and use to build trusted data supply chains.
@joalmeid @daltskin we need to decide if we are treating "airlock import/export" as a single feature or two for initial iteration. I'm happy either way, I had them as one, but then split it as we were going to focus on export first, and was a requirement for some workspaces to have one but not the other.
There is a lot of common ground, however might be worth considering requirements first.
I think @mjbonifa's diagram above is useful. As a start I'm thinking each workspace has an inbox and outbox, only accessible by the Data Steward/Airlock Manager for that workspace?
@mjbonifa understand in some scenarios data will not move, and this is about granting access, but feel the first iteration, based on previous work and customer requirements should focus on the movement of artefacts. I also think we should focus on external locations as source and destination, although remembering that another workspace inbox as a external location should be considered down the line.
The next stage is to define how data is transferred from:
Previously we have done this with file shares and storage explorer- might be acceptable for staging and download, but we need a more streamlined method with approvals.
@marrobi it makes sense for each workspace to have an inbox and outbox as you describe them, that construct allows us to work towards the above scenarios
Yes to remembering that another workspace inbox as a external location should be considered down the line maybe we should break this out into another issue?
Another point is that the diagram above could be interpreted incorrectly. The workspaces for risk management could be anywhere within a dataflow depending on the scenario. These are not just at the boundary between the TRE and processes external to the TRE, as could be assumed.
Understand that a first version will copy data through the process for now
Two new features
Re dataflow think we need clearer use cases, feels to me this is more of a data pipeline with with defined activities, including approval, between each movement of data/change of data access) . The workspace inbox/outbox being stages in that pipeline.
There has been some previous experiences/attempts on this. Coming from fresh perspectives, and focusing on a first iteration for airlock processes, believe in a few facts to start with:
This focus mostly on data copying and not providing data access
I posted this in 1109 but makes more sense here, I have reviewed the stories for data ingress / egress and moving large data files around and I thinking that there may be a simpler solution to this problem, does require some manual input but there are a lot of advantages.
1. Researcher uploads data via specific workspace in the TRE web portal No, only the PI should be able to copy agreed data into (or out of the Workspace) - only the PI has authority so they should be the ones with the security access to do that. This creates a real-world airgap in the form of the PI. The inbox / outbox requirement just becomes a storage/security issue INSIDE the VM. The PI can add/delete files in the inbox but can only read from outbox, the reverse is true for the researcher. 2. Data is scanned for viruses (and potentially PII) I believe this can only be outside of the TRE, therefore done through standard virus checking processes. Data must arrive on a UHS server before it arrives at the TRE - that would be the logical place to scan for virus. 3. Data import approver receives notification. They log onto a secure VM and can view file, and results of scans. In the example below the approver is the PI and is the one controlling the data movement, it is the researcher that is notified when the PI copies the data into the workspace. 4. Once approved data gets moved to user storage in the workspace, and researcher notified This step not required because it is the PI who controls the data movement process not the researcher and is accommodated in step 3.
This process solves the following:
The "airgap" is created by default and is a function of the security credentials of the PI. Unless the PI actively does something data does not move.
@CalMac-tns
If in the shared space then appropriate governance rules will need to be created for them for the connections between workspace and shared services
If in the workspace the appropriate processes established to get the audit data and any other info out of a workspace. I'd suggest that the audit function is a separate issue and relates to many events related to the workspace. Hopefully there's a place to discuss that elsewhere.
Yes I agree, to bring the requirement for delegation into scope we would need to introduce the staging area back into the process, we can't delegate access to a file etc if the rest of the TRE doesn't have visibility.
The process would be as follows:
The same process can the be reversed to get data out of the TRE because the permissions are only shared between the PI/Delegate and researcher.
The audit log is outside of the workspace but inside the TRE so that access can be gained for audit purposes, it should only contain metadata so no IG issues. The email would be a shared service but I'm open to suggestions on how notifications are handled.
Also agree that we need to decide what gets audited separately.
@CalMac-tns thanks for the input, a good discussion. I am conscious that we need to focus on requirements - from multiple sets of requirements- rather than specific implementation details at this point. There also is a bit of confusion around import vs export - we need to consider both flows.
@joalmeid is going to create some user stories from the various sets of requirements then we will work to create a technical architecture that meets this.
We've went through a set of existing requirements, guidance from HDRUK and current inputs on GH airlock import/export. Without taking any look into implementation details, there's a suggestion for the user stories of airlock export. Main goal is to fine tune user stories, and finally add it to GH Issues.
Considerations:
Main Goals: Preventing data exfiltration Export research artefacts such as: ○ New data sets ○ Machine learning models ○ Reports
Envisioned high-level Workflow
graph TD
A(fa:fa-box-open Workspace Storage) --> |Export request created|B(Internal Storage)
B -->C{fa:fa-spinner Approval?}
C --> |Approved| D(External Storage)
C -.- |Rejected| E(fa:fa-ban Deleted);
Draft User stories
As a Workspace TRE User/Researcher
I want to be able to upload data file (zip) to the Workspace storage
So that the data is available to start the airlock export process
Workspace storage
As a Workspace TRE User/Researcher I want to be able to invoke the airlock export process So that the data is available to start the airlock export process
Internal storage
Internal storage
As an automated process in TRE I want to execute the gates defined in the TRE export process So that I guarantee safety of the data
Internal storage
Internal storage
must be read-onlyAs a TRE Workspace Data Steward I want to be able see an airlock export process status by process id for a workspace So that I check the current overall export status and results of all the export gates in the airlock process.
As a TRE Workspace Data Steward I want to be able to update the airlock export approval state So that the airlock export process terminates and artefacts in the original request get deleted.
Internal storage
to the External storage
Internal storage
As a TRE User/Researcher
I want to be able to access the artefacts in the External storage
So that I can copy or use them within my workspace
External storage
is only available to the requesting TRE User/ResearcherExternal storage
External storage
External storage
for auditing purposeAs a TRE Admin I want to be able to define if a new workspace has airlock export enabled So that Workspace Owner or TRE User/Researcher have access or not to the feature
Thanks for the input @joalmeid , taking the story a step further how would PII scanning actually work with the TRE?
How would we deal with situation where a PI (Principal Investigator) delegates authority to someone else? In the story above a delegate would be similar to a Data Steward but would only get the role once delegated from the PI?
Does the Data Steward assignment need to be considered for the scope of this story? These stories assume the Data Steward role has already been assigned to individual(s) for that workspace The PII scanning could be an automated process (TBC) using a system account.
The process we are working to the Data Steward (aka Delegate) may not be known at the same time as the workspace creation, the PI will delegate the responsibility at a later date once the data sets have been received.
A user would be able to be assigned the Workspace Data Steward role at any point in time. Does that fit that need?
@marrobi Finally adding to the discussion here... I have started building Bicep templates to implement the flow found in the Architecture Center reference (https://docs.microsoft.com/en-us/azure/architecture/example-scenario/ai/secure-compute-for-research). This reference was written by one of my peers and it's commonly implemented in US EDU.
At this time, my templates are missing the Logic App to perform the approval. I am also reconsidering hosting the Azure Data Factory in the hub. It might just make sense to deploy the whole thing as a workspace service, instead of partially as a shared service, partially as a workspace service.
See the GitHub repo here: https://github.com/SvenAelterman/AzureTRE-ADF
I've also read something about virus scanning. Azure Storage has that built-in, but last time I checked, there's no event when the scan is finished and it can take hours. So I had previously developed an integration with VirusTotal, which can be found here: https://blog.aelterman.com/2021/02/21/on-demand-malware-scanning-for-azure-storage-blobs-with-virustotal/
Thanks @SvenAelterman . We're trying to crystalize the requirements and flow that makes sense in a TRE. Haven't break it down into implementation, but sure it will help. The airlock processes are definitely bound to a workspace. Regarding the malware scanning, we're also exploring other future Storage features we may use. Otherwise there is a similar implementation focused on az functions, storage and Windows Defender in a VM. It ends up being quite similar to the VIrusTotal
@SvenAelterman a thought - what if had a data factory self hosted integration runtime in the core/hub resource processor subnet, as this has access to all the workspace VNets (as the resource processor has to carry out data plane operations) so would be no need to add managed private endpoints.
Also any outbound traffic can also be routed via the Azure firewall to prevent data exfiltration and for auditing purposes.
What do you think vs managed network?
@marrobi Hadn't thought about that yet. It could be a useful solution. At the same time, it's yet one more VM to manage. I am not sure how most customers would balance that.
@SvenAelterman it could be run on a container instance maybe as per https://docs.microsoft.com/en-us/azure/data-factory/how-to-run-self-hosted-integration-runtime-in-windows-container, but that doesn't support auto update.
If a VM could likely be B series, with auto updates of the OS and integration runtime.
I had worked on a simplified way of achieving ingress/egress as a short term solution on my fork here.
The solution uses two storage accounts that sit within a workspace - one which is the current workspace SA which currently hosts vm-shared-storage
file share, and an additional one which is public facing. Both of which have an "ingress" and "egress" file share. With the workspace ingress/egress folders being mounted to user resources the same way vm-shared-storage
currently is.
On deploying the base workspace a AD Group is deployed with the intention that workspace PIs would be added to the group to gain the required permissions to carry out ingress/egress actions.
Access to the public or "airlock" SA would be via storage explorer to upload/retrieve files, with a script leveraging the PIs permissions to be used to copy files between the two SAs (currently a bash script sitting in the /Scripts/
folder on the branch. There is an additional script which could do this from within a user resource although this is not yet part of my branch.
This solution does not fully achieve what is intended for this feature although it may provide a starting point for what is to be produced.
design is done
Preventing data exfiltration is of absolute importance but there is a need to be able to export certain products of the work that have been done within the workspace such as ML models, new data sets to be pushed back to the data platform, reports, and similar artifacts.
A high level egress workflow would look like: