microsoft / AzureTRE

An accelerator to help organizations build Trusted Research Environments on Azure.
https://microsoft.github.io/AzureTRE
MIT License
179 stars 136 forks source link

Airlock - workspace data export (Design) #33

Closed christoferlof closed 2 years ago

christoferlof commented 3 years ago

Preventing data exfiltration is of absolute importance but there is a need to be able to export certain products of the work that have been done within the workspace such as ML models, new data sets to be pushed back to the data platform, reports, and similar artifacts.

A high level egress workflow would look like:

  1. Researcher uploads (or links to?) data via the TRE web portal from within a workspace
  2. Data is scanned for viruses (and potentially PII)
  3. Data export approver receives notification. They log onto a secure VM and can view file, and results of scans.
  4. Once approved data gets moved to a staging location
  5. Data can be downloaded by the researcher from via the portal from outside the workspace
marrobi commented 2 years ago

Let's consider the workflow identified here: https://docs.microsoft.com/en-us/azure/architecture/example-scenario/ai/secure-compute-for-research. as an initial data export workflow.

mjbonifa commented 2 years ago

The dataflow described is a pipeline/data supply chain with governance (i.e. contracts, licensing, etc) controlling ingress and egress.

Staging inputs and outputs can be considered an activity in a pipeline with different roles, responsibilities and tools to those used to produce the artefacts themselves.

Now, the primary resource within a TRE that provides governance boundaries is the "workspace".

One idea would be to consider ways to connect/orchestrate workspaces each with distinct ingress/egress policies governed and aligned with contracts/SLAs/agreements etc as appropriate. For example,

Ingress Staging Workspace (linking, pseudonymisation, de-identification, etc) -> ML Workspace (artefact production) -> Egress Staging Workspace (as above)

By using the workspace construct to flexibly provision environment and data controls (in the context of the principles of the 5 safes) we can create assurances that the workspace meets the legal requirements, etc.

A pipeline of workspaces would also start work towards federation of TREs.

marrobi commented 2 years ago

Interesting way of thinking about it.

So everything is a data import "job" specify to a destination workspace.

Can all users request to send data to any workspace (maybe even in another TRE), or only to workspaces they are a member of?

There still needs to be an approval flow, do we have Data Reviewer role (say Workspace Owner for now), who can see all artefacts that are pending approval to get into their workspace, and simply move it from a transfer location into their workspace shared storage should it be "approved"?

How does the approver then get it to the Researcher's outside location - say client machine? Is this a special case. Do they need to be able to download their approved exports?

mjbonifa commented 2 years ago

So everything is a data import "job" specify to a destination workspace.

Yes, everything is a "process" or "job" as you say. This is useful for risk assessment and mitigation. It clearly identifies security zones with the ability to implement "controls" between processes. This could considering aspects of data de-identification and environment configuration for downstream processes.

Can all users request to send data to any workspace (maybe even in another TRE), or only to workspaces they are a member of?

That's a policy decision depending on the data rights but it should be the "Workspace Owner" responsibility (or their delegate.). I think it's important recognise that the flow of data is also a flow of ownership and rights that constrain downstream processing.

In terms of defining workspace connectivity, I'd suggest that a workspace owner would specify a whitelist of downstream workspaces that are allowed to request artefacts.

Workspace owners should not need to be members of downstream workspaces. I think it should be the other way around with downstream "Workspace Owners" or external processes requesting upstream artefacts (pull).

There still needs to be an approval flow, do we have Data Reviewer role (say Workspace Owner for now), who can see all artefacts that are pending approval to get into their workspace, and simply move it from a transfer location into their workspace shared storage should it be "approved"?

Yes, the artefact requests will need to be approved and it could be the workspace owner for workspace processes or delegate to some "Data Reviewer".

If a workspace request is to an upstream service outside of the current TRE (e.g. NHS provider) it would follow the same sort of protocol. That's how it happens today.

I am concerned with the assumption of "data moving/copied" through workspaces. We need to be careful about that. The assumption that data is copied through workspaces processes may be not be feasible for larger datasets. However, with a supply chain model it will be possible to deploy workspaces next to the data without moving the data to the workspace. There's also the whole world of data virtualisation options in relation to TREs and workspaces. What we should be doing is managing to the flow of rights to access data independent of the data tech if possible

How does the approver then get it to the Researcher's outside location - say client machine? Is this a special case. Do they need to be able to download their approved exports?

I'm not sure what you mean. Some workspaces may under certain conditions should allow for artefacts to be exported from the TRE.

We should also remember that the processing of data is undertaken in the context of a risk management process. Take a look at the diagram below that outlines the trust, dependancy and rights relationships between processes. In the example, we have

What's important is the explicit injection of risk management into processes to allow for points of governance and stewardship, along with an overarching model of risk

This is one flow and there are primitives (resource sharing, resource encapsulation, etc.) that we can consider and use to build trusted data supply chains.

tre-workspace-processes

marrobi commented 2 years ago

@joalmeid @daltskin we need to decide if we are treating "airlock import/export" as a single feature or two for initial iteration. I'm happy either way, I had them as one, but then split it as we were going to focus on export first, and was a requirement for some workspaces to have one but not the other.

There is a lot of common ground, however might be worth considering requirements first.

I think @mjbonifa's diagram above is useful. As a start I'm thinking each workspace has an inbox and outbox, only accessible by the Data Steward/Airlock Manager for that workspace?

@mjbonifa understand in some scenarios data will not move, and this is about granting access, but feel the first iteration, based on previous work and customer requirements should focus on the movement of artefacts. I also think we should focus on external locations as source and destination, although remembering that another workspace inbox as a external location should be considered down the line.

The next stage is to define how data is transferred from:

Previously we have done this with file shares and storage explorer- might be acceptable for staging and download, but we need a more streamlined method with approvals.

mjbonifa commented 2 years ago

@marrobi it makes sense for each workspace to have an inbox and outbox as you describe them, that construct allows us to work towards the above scenarios

Yes to remembering that another workspace inbox as a external location should be considered down the line maybe we should break this out into another issue?

Another point is that the diagram above could be interpreted incorrectly. The workspaces for risk management could be anywhere within a dataflow depending on the scenario. These are not just at the boundary between the TRE and processes external to the TRE, as could be assumed.

Understand that a first version will copy data through the process for now

marrobi commented 2 years ago

Two new features

Re dataflow think we need clearer use cases, feels to me this is more of a data pipeline with with defined activities, including approval, between each movement of data/change of data access) . The workspace inbox/outbox being stages in that pipeline.

joalmeid commented 2 years ago

There has been some previous experiences/attempts on this. Coming from fresh perspectives, and focusing on a first iteration for airlock processes, believe in a few facts to start with:

This focus mostly on data copying and not providing data access

CalMac-tns commented 2 years ago

I posted this in 1109 but makes more sense here, I have reviewed the stories for data ingress / egress and moving large data files around and I thinking that there may be a simpler solution to this problem, does require some manual input but there are a lot of advantages.

1. Researcher uploads data via specific workspace in the TRE web portal No, only the PI should be able to copy agreed data into (or out of the Workspace) - only the PI has authority so they should be the ones with the security access to do that. This creates a real-world airgap in the form of the PI. The inbox / outbox requirement just becomes a storage/security issue INSIDE the VM. The PI can add/delete files in the inbox but can only read from outbox, the reverse is true for the researcher. 2. Data is scanned for viruses (and potentially PII) I believe this can only be outside of the TRE, therefore done through standard virus checking processes. Data must arrive on a UHS server before it arrives at the TRE - that would be the logical place to scan for virus. 3. Data import approver receives notification. They log onto a secure VM and can view file, and results of scans. In the example below the approver is the PI and is the one controlling the data movement, it is the researcher that is notified when the PI copies the data into the workspace. 4. Once approved data gets moved to user storage in the workspace, and researcher notified This step not required because it is the PI who controls the data movement process not the researcher and is accommodated in step 3.

This process solves the following:

  1. Researchers cannot copy data in or out of the TRE.
  2. The PI is responsible for all data movements.
  3. Data is only copied into the TRE once approved - it's the PI's responsibility.
  4. All files that are copied into, created in, or deleted from the VM storage is automatically logged.
  5. Any size and any number of files can be dealt with using Storage Explorer.

The "airgap" is created by default and is a function of the security credentials of the PI. Unless the PI actively does something data does not move.

image

mjbonifa commented 2 years ago

@CalMac-tns

If in the shared space then appropriate governance rules will need to be created for them for the connections between workspace and shared services

If in the workspace the appropriate processes established to get the audit data and any other info out of a workspace. I'd suggest that the audit function is a separate issue and relates to many events related to the workspace. Hopefully there's a place to discuss that elsewhere.

data-import

CalMac-tns commented 2 years ago

Yes I agree, to bring the requirement for delegation into scope we would need to introduce the staging area back into the process, we can't delegate access to a file etc if the rest of the TRE doesn't have visibility.

The process would be as follows:

  1. PI or Owner receives the data outside of the TRE but within the scope of UHS Network. (as they do now)
  2. The PI or Owner then copies the data to the staging area - only THEY can see this data at this point.
  3. They can continue as PI or delegate, my idea as that delegation is to simply run a script such as *"delegate files[] workspaceid email"** which changes access in the files in the staging area and creates the appropriate permissions within the workspaces.
  4. The delegate can then continue as PI delegate

The same process can the be reversed to get data out of the TRE because the permissions are only shared between the PI/Delegate and researcher.

The audit log is outside of the workspace but inside the TRE so that access can be gained for audit purposes, it should only contain metadata so no IG issues. The email would be a shared service but I'm open to suggestions on how notifications are handled.

Also agree that we need to decide what gets audited separately.

image

marrobi commented 2 years ago

@CalMac-tns thanks for the input, a good discussion. I am conscious that we need to focus on requirements - from multiple sets of requirements- rather than specific implementation details at this point. There also is a bit of confusion around import vs export - we need to consider both flows.

@joalmeid is going to create some user stories from the various sets of requirements then we will work to create a technical architecture that meets this.

joalmeid commented 2 years ago

We've went through a set of existing requirements, guidance from HDRUK and current inputs on GH airlock import/export. Without taking any look into implementation details, there's a suggestion for the user stories of airlock export. Main goal is to fine tune user stories, and finally add it to GH Issues.

Considerations:

Main Goals: Preventing data exfiltration Export research artefacts such as: ○ New data sets ○ Machine learning models ○ Reports

Envisioned high-level Workflow

  1. Researcher requires artefacts to be exported and begins the airlock export process
  2. Data goes through set of automated gates (ex: scanning for PII, virus scan?, accepting criteria - filetypes, size)
  3. Data export approver receives notification. They log onto a secure environment and can view the artefacts, and results of gates.
  4. Approved export artefacts get moved to a staging location
  5. Artefacts can be downloaded by the researcher from outside the workspace
graph TD
    A(fa:fa-box-open Workspace Storage) --> |Export request created|B(Internal Storage)
    B -->C{fa:fa-spinner Approval?}
    C --> |Approved| D(External Storage)
    C -.- |Rejected| E(fa:fa-ban Deleted);

Draft User stories

As a Workspace TRE User/Researcher I want to be able to upload data file (zip) to the Workspace storage So that the data is available to start the airlock export process

As a Workspace TRE User/Researcher I want to be able to invoke the airlock export process So that the data is available to start the airlock export process

As an automated process in TRE I want to execute the gates defined in the TRE export process So that I guarantee safety of the data

As a TRE Workspace Data Steward I want to be able see an airlock export process status by process id for a workspace So that I check the current overall export status and results of all the export gates in the airlock process.

As a TRE Workspace Data Steward I want to be able to update the airlock export approval state So that the airlock export process terminates and artefacts in the original request get deleted.

As a TRE User/Researcher I want to be able to access the artefacts in the External storage So that I can copy or use them within my workspace

As a TRE Admin I want to be able to define if a new workspace has airlock export enabled So that Workspace Owner or TRE User/Researcher have access or not to the feature

CalMac-tns commented 2 years ago

Thanks for the input @joalmeid , taking the story a step further how would PII scanning actually work with the TRE?

How would we deal with situation where a PI (Principal Investigator) delegates authority to someone else? In the story above a delegate would be similar to a Data Steward but would only get the role once delegated from the PI?

daltskin commented 2 years ago

Does the Data Steward assignment need to be considered for the scope of this story? These stories assume the Data Steward role has already been assigned to individual(s) for that workspace The PII scanning could be an automated process (TBC) using a system account.

CalMac-tns commented 2 years ago

The process we are working to the Data Steward (aka Delegate) may not be known at the same time as the workspace creation, the PI will delegate the responsibility at a later date once the data sets have been received.

marrobi commented 2 years ago

A user would be able to be assigned the Workspace Data Steward role at any point in time. Does that fit that need?

SvenAelterman commented 2 years ago

@marrobi Finally adding to the discussion here... I have started building Bicep templates to implement the flow found in the Architecture Center reference (https://docs.microsoft.com/en-us/azure/architecture/example-scenario/ai/secure-compute-for-research). This reference was written by one of my peers and it's commonly implemented in US EDU.

At this time, my templates are missing the Logic App to perform the approval. I am also reconsidering hosting the Azure Data Factory in the hub. It might just make sense to deploy the whole thing as a workspace service, instead of partially as a shared service, partially as a workspace service.

See the GitHub repo here: https://github.com/SvenAelterman/AzureTRE-ADF

I've also read something about virus scanning. Azure Storage has that built-in, but last time I checked, there's no event when the scan is finished and it can take hours. So I had previously developed an integration with VirusTotal, which can be found here: https://blog.aelterman.com/2021/02/21/on-demand-malware-scanning-for-azure-storage-blobs-with-virustotal/

joalmeid commented 2 years ago

Thanks @SvenAelterman . We're trying to crystalize the requirements and flow that makes sense in a TRE. Haven't break it down into implementation, but sure it will help. The airlock processes are definitely bound to a workspace. Regarding the malware scanning, we're also exploring other future Storage features we may use. Otherwise there is a similar implementation focused on az functions, storage and Windows Defender in a VM. It ends up being quite similar to the VIrusTotal

marrobi commented 2 years ago

@SvenAelterman a thought - what if had a data factory self hosted integration runtime in the core/hub resource processor subnet, as this has access to all the workspace VNets (as the resource processor has to carry out data plane operations) so would be no need to add managed private endpoints.

Also any outbound traffic can also be routed via the Azure firewall to prevent data exfiltration and for auditing purposes.

What do you think vs managed network?

SvenAelterman commented 2 years ago

@marrobi Hadn't thought about that yet. It could be a useful solution. At the same time, it's yet one more VM to manage. I am not sure how most customers would balance that.

marrobi commented 2 years ago

@SvenAelterman it could be run on a container instance maybe as per https://docs.microsoft.com/en-us/azure/data-factory/how-to-run-self-hosted-integration-runtime-in-windows-container, but that doesn't support auto update.

If a VM could likely be B series, with auto updates of the OS and integration runtime.

oliver7598 commented 2 years ago

I had worked on a simplified way of achieving ingress/egress as a short term solution on my fork here.

The solution uses two storage accounts that sit within a workspace - one which is the current workspace SA which currently hosts vm-shared-storage file share, and an additional one which is public facing. Both of which have an "ingress" and "egress" file share. With the workspace ingress/egress folders being mounted to user resources the same way vm-shared-storage currently is.

On deploying the base workspace a AD Group is deployed with the intention that workspace PIs would be added to the group to gain the required permissions to carry out ingress/egress actions.

Access to the public or "airlock" SA would be via storage explorer to upload/retrieve files, with a script leveraging the PIs permissions to be used to copy files between the two SAs (currently a bash script sitting in the /Scripts/ folder on the branch. There is an additional script which could do this from within a user resource although this is not yet part of my branch.

This solution does not fully achieve what is intended for this feature although it may provide a starting point for what is to be produced.

eladiw commented 2 years ago

design is done