Airlock - workspace data export (Design)

christoferlof commented 3 years ago

Preventing data exfiltration is of absolute importance but there is a need to be able to export certain products of the work that have been done within the workspace such as ML models, new data sets to be pushed back to the data platform, reports, and similar artifacts.

A high level egress workflow would look like:

Researcher uploads (or links to?) data via the TRE web portal from within a workspace
Data is scanned for viruses (and potentially PII)
Data export approver receives notification. They log onto a secure VM and can view file, and results of scans.
Once approved data gets moved to a staging location
Data can be downloaded by the researcher from via the portal from outside the workspace

marrobi commented 2 years ago

Let's consider the workflow identified here: https://docs.microsoft.com/en-us/azure/architecture/example-scenario/ai/secure-compute-for-research. as an initial data export workflow.

mjbonifa commented 2 years ago

The dataflow described is a pipeline/data supply chain with governance (i.e. contracts, licensing, etc) controlling ingress and egress.

Staging inputs and outputs can be considered an activity in a pipeline with different roles, responsibilities and tools to those used to produce the artefacts themselves.

Now, the primary resource within a TRE that provides governance boundaries is the "workspace".

One idea would be to consider ways to connect/orchestrate workspaces each with distinct ingress/egress policies governed and aligned with contracts/SLAs/agreements etc as appropriate. For example,

Ingress Staging Workspace (linking, pseudonymisation, de-identification, etc) -> ML Workspace (artefact production) -> Egress Staging Workspace (as above)

By using the workspace construct to flexibly provision environment and data controls (in the context of the principles of the 5 safes) we can create assurances that the workspace meets the legal requirements, etc.

A pipeline of workspaces would also start work towards federation of TREs.

marrobi commented 2 years ago

Interesting way of thinking about it.

So everything is a data import "job" specify to a destination workspace.

Can all users request to send data to any workspace (maybe even in another TRE), or only to workspaces they are a member of?

There still needs to be an approval flow, do we have Data Reviewer role (say Workspace Owner for now), who can see all artefacts that are pending approval to get into their workspace, and simply move it from a transfer location into their workspace shared storage should it be "approved"?

How does the approver then get it to the Researcher's outside location - say client machine? Is this a special case. Do they need to be able to download their approved exports?

mjbonifa commented 2 years ago

So everything is a data import "job" specify to a destination workspace.

Yes, everything is a "process" or "job" as you say. This is useful for risk assessment and mitigation. It clearly identifies security zones with the ability to implement "controls" between processes. This could considering aspects of data de-identification and environment configuration for downstream processes.

Can all users request to send data to any workspace (maybe even in another TRE), or only to workspaces they are a member of?

That's a policy decision depending on the data rights but it should be the "Workspace Owner" responsibility (or their delegate.). I think it's important recognise that the flow of data is also a flow of ownership and rights that constrain downstream processing.

In terms of defining workspace connectivity, I'd suggest that a workspace owner would specify a whitelist of downstream workspaces that are allowed to request artefacts.

Workspace owners should not need to be members of downstream workspaces. I think it should be the other way around with downstream "Workspace Owners" or external processes requesting upstream artefacts (pull).

There still needs to be an approval flow, do we have Data Reviewer role (say Workspace Owner for now), who can see all artefacts that are pending approval to get into their workspace, and simply move it from a transfer location into their workspace shared storage should it be "approved"?

Yes, the artefact requests will need to be approved and it could be the workspace owner for workspace processes or delegate to some "Data Reviewer".

If a workspace request is to an upstream service outside of the current TRE (e.g. NHS provider) it would follow the same sort of protocol. That's how it happens today.

I am concerned with the assumption of "data moving/copied" through workspaces. We need to be careful about that. The assumption that data is copied through workspaces processes may be not be feasible for larger datasets. However, with a supply chain model it will be possible to deploy workspaces next to the data without moving the data to the workspace. There's also the whole world of data virtualisation options in relation to TREs and workspaces. What we should be doing is managing to the flow of rights to access data independent of the data tech if possible

How does the approver then get it to the Researcher's outside location - say client machine? Is this a special case. Do they need to be able to download their approved exports?

I'm not sure what you mean. Some workspaces may under certain conditions should allow for artefacts to be exported from the TRE.

We should also remember that the processing of data is undertaken in the context of a risk management process. Take a look at the diagram below that outlines the trust, dependancy and rights relationships between processes. In the example, we have

Two external processes (producer A, producer B) producing data (e.g, hospital A, hospital B)
One workspace process (risk management A) for staging and approving input (linking, pseudonymisation, de-identification, etc) responsible for risk management.
One workspace process (research) for undertaking the analysis
Two workspace process (risk management B, risk management C) for output staging and approving artefacts for export external to the TRE. Two workspaces are described to indicate that you could have outputs with different rights, requiring different sorts of processing.
Two external processes (consumer A, consumer B) for downstream processing of the outputs, this could be another TRE or something else.

What's important is the explicit injection of risk management into processes to allow for points of governance and stewardship, along with an overarching model of risk

This is one flow and there are primitives (resource sharing, resource encapsulation, etc.) that we can consider and use to build trusted data supply chains.

tre-workspace-processes

marrobi commented 2 years ago

@joalmeid @daltskin we need to decide if we are treating "airlock import/export" as a single feature or two for initial iteration. I'm happy either way, I had them as one, but then split it as we were going to focus on export first, and was a requirement for some workspaces to have one but not the other.

There is a lot of common ground, however might be worth considering requirements first.

I think @mjbonifa's diagram above is useful. As a start I'm thinking each workspace has an inbox and outbox, only accessible by the Data Steward/Airlock Manager for that workspace?

@mjbonifa understand in some scenarios data will not move, and this is about granting access, but feel the first iteration, based on previous work and customer requirements should focus on the movement of artefacts. I also think we should focus on external locations as source and destination, although remembering that another workspace inbox as a external location should be considered down the line.

The next stage is to define how data is transferred from:

source location -> workspace inbox -> approval -> workspace storage
workspace storage -> workspace outbox -> approval -> destination location

Previously we have done this with file shares and storage explorer- might be acceptable for staging and download, but we need a more streamlined method with approvals.

mjbonifa commented 2 years ago

@marrobi it makes sense for each workspace to have an inbox and outbox as you describe them, that construct allows us to work towards the above scenarios

Yes to remembering that another workspace inbox as a external location should be considered down the line maybe we should break this out into another issue?

Another point is that the diagram above could be interpreted incorrectly. The workspaces for risk management could be anywhere within a dataflow depending on the scenario. These are not just at the boundary between the TRE and processes external to the TRE, as could be assumed.

Understand that a first version will copy data through the process for now

marrobi commented 2 years ago

Two new features

Re dataflow think we need clearer use cases, feels to me this is more of a data pipeline with with defined activities, including approval, between each movement of data/change of data access) . The workspace inbox/outbox being stages in that pipeline.

joalmeid commented 2 years ago

There has been some previous experiences/attempts on this. Coming from fresh perspectives, and focusing on a first iteration for airlock processes, believe in a few facts to start with:

Airlock export/import to be separate features, but requirement wise, there must be alignment.
Workspace must contain inbox (storage share)
Workspace must contain outbox (storage share)
Outbox (storage share) should be reachable from a TRE system principal to enable sharing across workspaces
TRE Composing service implementing airlock sub-processes including virus scan,
TRE API allowing to start/list/stop import/export processes, per workspace
TRE must notify the individual/group with Data Steward role of a pending approval
individual/group with Data Steward role must the only with access to the data in the airlock process
TRE API allowing to approve/deny airlock process, moving data accordingly

This focus mostly on data copying and not providing data access

CalMac-tns commented 2 years ago

I posted this in 1109 but makes more sense here, I have reviewed the stories for data ingress / egress and moving large data files around and I thinking that there may be a simpler solution to this problem, does require some manual input but there are a lot of advantages.

1. Researcher uploads data via specific workspace in the TRE web portal No, only the PI should be able to copy agreed data into (or out of the Workspace) - only the PI has authority so they should be the ones with the security access to do that. This creates a real-world airgap in the form of the PI. The inbox / outbox requirement just becomes a storage/security issue INSIDE the VM. The PI can add/delete files in the inbox but can only read from outbox, the reverse is true for the researcher. 2. Data is scanned for viruses (and potentially PII) I believe this can only be outside of the TRE, therefore done through standard virus checking processes. Data must arrive on a UHS server before it arrives at the TRE - that would be the logical place to scan for virus. 3. Data import approver receives notification. They log onto a secure VM and can view file, and results of scans. In the example below the approver is the PI and is the one controlling the data movement, it is the researcher that is notified when the PI copies the data into the workspace. 4. Once approved data gets moved to user storage in the workspace, and researcher notified This step not required because it is the PI who controls the data movement process not the researcher and is accommodated in step 3.

This process solves the following:

Researchers cannot copy data in or out of the TRE.
The PI is responsible for all data movements.
Data is only copied into the TRE once approved - it's the PI's responsibility.
All files that are copied into, created in, or deleted from the VM storage is automatically logged.
Any size and any number of files can be dealt with using Storage Explorer.

The "airgap" is created by default and is a function of the security credentials of the PI. Unless the PI actively does something data does not move.

mjbonifa commented 2 years ago

@CalMac-tns

Generally agree with the approach as it's essentially how it works today in terms of responsibilities.
Agreed that it's the PIs responsibility for projects conducted in workspaces. I'd prefer to be precise on roles as PI is not something in the Azure TRE terminology. Azure have "TRE workspace owner" which maps onto PI in some settings
We may want to consider a Data Steward Role that a TRE Workspace Owner can delegate to, to manage the ingest on their behalf. There are many situations where the PI has responsibility but not necessarily the technical skills to manage data and therefore delegates to a senior member of the research team to take the action.
We must explicitly consider the boundary of a TRE and the boundary of a workspace, along with the distribution of function between them. i.e. what is shared and what is workspace specific. In your diagram it's not clear to be if Audit and Notification are Workspace Resources or Shared Services. In the figure below I've put a box around the workspace resources and also questions regarding functions associated with audit and notification.

If in the shared space then appropriate governance rules will need to be created for them for the connections between workspace and shared services

If in the workspace the appropriate processes established to get the audit data and any other info out of a workspace. I'd suggest that the audit function is a separate issue and relates to many events related to the workspace. Hopefully there's a place to discuss that elsewhere.

data-import

CalMac-tns commented 2 years ago

Yes I agree, to bring the requirement for delegation into scope we would need to introduce the staging area back into the process, we can't delegate access to a file etc if the rest of the TRE doesn't have visibility.

The process would be as follows:

PI or Owner receives the data outside of the TRE but within the scope of UHS Network. (as they do now)
The PI or Owner then copies the data to the staging area - only THEY can see this data at this point.
They can continue as PI or delegate, my idea as that delegation is to simply run a script such as *"delegate files[] workspaceid email"** which changes access in the files in the staging area and creates the appropriate permissions within the workspaces.
The delegate can then continue as PI delegate

The same process can the be reversed to get data out of the TRE because the permissions are only shared between the PI/Delegate and researcher.

The audit log is outside of the workspace but inside the TRE so that access can be gained for audit purposes, it should only contain metadata so no IG issues. The email would be a shared service but I'm open to suggestions on how notifications are handled.

Also agree that we need to decide what gets audited separately.

marrobi commented 2 years ago

@CalMac-tns thanks for the input, a good discussion. I am conscious that we need to focus on requirements - from multiple sets of requirements- rather than specific implementation details at this point. There also is a bit of confusion around import vs export - we need to consider both flows.

@joalmeid is going to create some user stories from the various sets of requirements then we will work to create a technical architecture that meets this.

joalmeid commented 2 years ago

We've went through a set of existing requirements, guidance from HDRUK and current inputs on GH airlock import/export. Without taking any look into implementation details, there's a suggestion for the user stories of airlock export. Main goal is to fine tune user stories, and finally add it to GH Issues.

Considerations:

Bullets would be acceptance criteria
Workspace Data Steward can be TRE workspace owner, in initial phases

Main Goals: Preventing data exfiltration Export research artefacts such as: ○ New data sets ○ Machine learning models ○ Reports

Envisioned high-level Workflow

Researcher requires artefacts to be exported and begins the airlock export process
Data goes through set of automated gates (ex: scanning for PII, virus scan?, accepting criteria - filetypes, size)
Data export approver receives notification. They log onto a secure environment and can view the artefacts, and results of gates.
Approved export artefacts get moved to a staging location
Artefacts can be downloaded by the researcher from outside the workspace

graph TD
    A(fa:fa-box-open Workspace Storage) --> |Export request created|B(Internal Storage)
    B -->C{fa:fa-spinner Approval?}
    C --> |Approved| D(External Storage)
    C -.- |Rejected| E(fa:fa-ban Deleted);

Draft User stories

As a Workspace TRE User/Researcher I want to be able to upload data file (zip) to the Workspace storage So that the data is available to start the airlock export process

Researcher has read/write access to the Workspace storage
Researcher can invoke the export process
Data upload gets logged for auditing purpose

As a Workspace TRE User/Researcher I want to be able to invoke the airlock export process So that the data is available to start the airlock export process

The airlock export process gets created and persisted
The uploaded data file (zip) is copied to the Internal storage
Workspace Data Steward has read access to the Internal storage
Export process gets logged for auditing purpose

As an automated process in TRE I want to execute the gates defined in the TRE export process So that I guarantee safety of the data

Defined set of workspace export gates that can be selected for export validation (ex: PII scanning)
Notify the Workspace Data Steward and/or requester once all the gates are complete with an export process id
Workspace Data Steward can access the Internal storage
Artefacts ingested into the Internal storage must be read-only
Export gates results are logged for auditing purpose

As a TRE Workspace Data Steward I want to be able see an airlock export process status by process id for a workspace So that I check the current overall export status and results of all the export gates in the airlock process.

A TRE API method to retrieve the current status of an existing airlock process

As a TRE Workspace Data Steward I want to be able to update the airlock export approval state So that the airlock export process terminates and artefacts in the original request get deleted.

A TRE API method to update the airlock process with a true/false and an optional comment
[Approved] - Automatically copy the artefacts from the Internal storage to the External storage
Automatically delete the artefacts from the Internal storage
A notification is sent to the Workspace Data Steward and/or requester with the airlock process final status
Approval status (approved/denied) of the Workspace Data Steward logged for auditing purpose

As a TRE User/Researcher I want to be able to access the artefacts in the External storage So that I can copy or use them within my workspace

The External storage is only available to the requesting TRE User/Researcher
The Workspace Data Steward must be able to access to the External storage
The requesting TRE User/Researcher and Workspace Data Steward cannot delete the artefacts in External storage
Log any access to artefacts in External storage for auditing purpose

As a TRE Admin I want to be able to define if a new workspace has airlock export enabled So that Workspace Owner or TRE User/Researcher have access or not to the feature

The workspace template has property to enable/disable for airlock export
According to workspace configuration, the Workspace Owner can/can't use the airlock export feature
According to workspace configuration, the TRE User/Researcher can/can't use the airlock export feature
On a workspace with airlock export feature disabled, invoking the airlock export must provide and error stating the feature is disabled

CalMac-tns commented 2 years ago

Thanks for the input @joalmeid , taking the story a step further how would PII scanning actually work with the TRE?

How would we deal with situation where a PI (Principal Investigator) delegates authority to someone else? In the story above a delegate would be similar to a Data Steward but would only get the role once delegated from the PI?

daltskin commented 2 years ago

Does the Data Steward assignment need to be considered for the scope of this story? These stories assume the Data Steward role has already been assigned to individual(s) for that workspace The PII scanning could be an automated process (TBC) using a system account.

CalMac-tns commented 2 years ago

The process we are working to the Data Steward (aka Delegate) may not be known at the same time as the workspace creation, the PI will delegate the responsibility at a later date once the data sets have been received.

marrobi commented 2 years ago

A user would be able to be assigned the Workspace Data Steward role at any point in time. Does that fit that need?

SvenAelterman commented 2 years ago

@marrobi Finally adding to the discussion here... I have started building Bicep templates to implement the flow found in the Architecture Center reference (https://docs.microsoft.com/en-us/azure/architecture/example-scenario/ai/secure-compute-for-research). This reference was written by one of my peers and it's commonly implemented in US EDU.

At this time, my templates are missing the Logic App to perform the approval. I am also reconsidering hosting the Azure Data Factory in the hub. It might just make sense to deploy the whole thing as a workspace service, instead of partially as a shared service, partially as a workspace service.

See the GitHub repo here: https://github.com/SvenAelterman/AzureTRE-ADF

I've also read something about virus scanning. Azure Storage has that built-in, but last time I checked, there's no event when the scan is finished and it can take hours. So I had previously developed an integration with VirusTotal, which can be found here: https://blog.aelterman.com/2021/02/21/on-demand-malware-scanning-for-azure-storage-blobs-with-virustotal/

joalmeid commented 2 years ago

Thanks @SvenAelterman . We're trying to crystalize the requirements and flow that makes sense in a TRE. Haven't break it down into implementation, but sure it will help. The airlock processes are definitely bound to a workspace. Regarding the malware scanning, we're also exploring other future Storage features we may use. Otherwise there is a similar implementation focused on az functions, storage and Windows Defender in a VM. It ends up being quite similar to the VIrusTotal

marrobi commented 2 years ago

@SvenAelterman a thought - what if had a data factory self hosted integration runtime in the core/hub resource processor subnet, as this has access to all the workspace VNets (as the resource processor has to carry out data plane operations) so would be no need to add managed private endpoints.

Also any outbound traffic can also be routed via the Azure firewall to prevent data exfiltration and for auditing purposes.

What do you think vs managed network?

SvenAelterman commented 2 years ago

@marrobi Hadn't thought about that yet. It could be a useful solution. At the same time, it's yet one more VM to manage. I am not sure how most customers would balance that.

marrobi commented 2 years ago

@SvenAelterman it could be run on a container instance maybe as per https://docs.microsoft.com/en-us/azure/data-factory/how-to-run-self-hosted-integration-runtime-in-windows-container, but that doesn't support auto update.

If a VM could likely be B series, with auto updates of the OS and integration runtime.

oliver7598 commented 2 years ago

I had worked on a simplified way of achieving ingress/egress as a short term solution on my fork here.

The solution uses two storage accounts that sit within a workspace - one which is the current workspace SA which currently hosts vm-shared-storage file share, and an additional one which is public facing. Both of which have an "ingress" and "egress" file share. With the workspace ingress/egress folders being mounted to user resources the same way vm-shared-storage currently is.

On deploying the base workspace a AD Group is deployed with the intention that workspace PIs would be added to the group to gain the required permissions to carry out ingress/egress actions.

Access to the public or "airlock" SA would be via storage explorer to upload/retrieve files, with a script leveraging the PIs permissions to be used to copy files between the two SAs (currently a bash script sitting in the /Scripts/ folder on the branch. There is an additional script which could do this from within a user resource although this is not yet part of my branch.

This solution does not fully achieve what is intended for this feature although it may provide a starting point for what is to be produced.

eladiw commented 2 years ago

design is done

microsoft / AzureTRE

Airlock - workspace data export (Design) #33