microsoft / AzureTRE

An accelerator to help organizations build Trusted Research Environments on Azure.
https://microsoft.github.io/AzureTRE
MIT License
179 stars 136 forks source link

Airlock - workspace data import (Design) #1109

Closed marrobi closed 2 years ago

marrobi commented 2 years ago

Organizations wish to have control of data that is imported into a workspace to prevent malicious software being installed, and datasets that allow relinking and hence identification of individuals.

A high level ingress workflow may be:

  1. Researcher uploads data via specific workspace in the TRE web portal
  2. Data is scanned for viruses (and potentially PII)
  3. Data import approver receives notification. They log onto a secure VM and can view file, and results of scans.
  4. Once approved data gets moved to user storage in the workspace, and researcher notified
marrobi commented 2 years ago

Given overlap with #33 suggest handle #33 first.

CalMac-tns commented 2 years ago

I have been giving some thought to the process of getting data into and out of the TRE, as a minimum we could potentially follow a simple process as outlined in the diagram below and utilise the event management process from within the storage areas. Files can be copied using Storage Explorer and the events of file creation and deletion & notification audited automatically. The same process can be reversed for getting data out of the TRE, but in essentially no data can leave the TRE unless the PI physically copies it - which will be audited so the process for getting data in is the same as getting it out.

I do not believe it is possible to virus scan once inside the TRE that would have to be a function outside, happy to be corrected on that one if there is a way.

image

CalMac-tns commented 2 years ago

Walking through the process and taking the requirements from 11 Jan into account we can do away with the staging area and simplify the process further.

1. Researcher uploads data via specific workspace in the TRE web portal No, only the PI should be able to copy agreed data into (or out of the Workspace) - only the PI has authority so they should be the ones with the security access to do that. This creates a real-world airgap in the form of the PI. 2. Data is scanned for viruses (and potentially PII) I believe this can only be outside of the TRE, therefore done through standard virus checking processes 3. Data import approver receives notification. They log onto a secure VM and can view file, and results of scans. In the example below the approver is the PI and is the one controlling the data movement, it is the researcher that is notified when the PI copies the data into the workspace. 4. Once approved data gets moved to user storage in the workspace, and researcher notified This step not required because is the PI who controls the data movement process not the researcher and is accommodated in step 3.

This process solves the following:

  1. Researchers cannot copy data in or out of the TRE.
  2. The PI is responsible for all data movements.
  3. Data is only copied into the TRE once approved - it's the PI's responsibility.
  4. All files that are copied into, created in, or deleted from the VM storage is automatically logged.
  5. Any size and any number of files can be dealt with using Storage Explorer.

The "airgap" is created by default and is a function of the security credentials of the PI. Unless the PI actively does something data does not move.

image

joalmeid commented 2 years ago

We've went through a set of existing requirements, guidance from HDRUK and current inputs on GH airlock import/export. Without taking any look into implementation details, there's a suggestion for the user stories of airlock import. Main goal is to fine tune user stories, and finally add it to GH Issues.

Considerations:

Main Goals: ○ Import certain input data: data files, ML Models, SQL database, csvs, code snippets ○ Prevent malware inside TRE and/or TRE workspaces ○ Automated and approval process built-in

Envisioned high-level Workflow

  1. TRE User/Researcher/Workspace Data Steward/TRE Owner ingests data to specific workspace in the TRE
  2. Data goes through a series of gates before manual approval. Current gate is Malware scanning. Others may come in future ex: PII scanning)
  3. Workspace Data Steward receives notification for approval. They log onto a secure VM and can view file, and gate results.
  4. Approved data gets moved to the Workspace Storage, and requester gets notified
graph TD
    B(External Storage) -->|Import Request created| C(fa:fa-spinner Protected Storage)
    C --> D{Gates - Malware Scanning} 
    D --> |Clean data| E[Internal Storage]
    D --> |Threats Found| F[Quarantine Storage]
    E --> G{Approval?} 
    G --> |Approved| H[Workspace Storage]
    G --> |Rejected| I[Rejected Storage]

Draft User stories:

As a TRE User/Researcher/Workspace Owner/Workspace Data Steward I want to be able to upload data file (zip) to the External Storage So that I make the data available to start the airlock import process

As a TRE User/Researcher/Workspace Owner/Workspace Data Steward I want to be able to invoke the airlock import process So that the data is available to start the airlock import process

As an automated process in TRE I want to execute the gates defined in the import process by the TRE So that I guarantee safety of the data

As an automated process in TRE I want to execute a virus scanning gate on the data in the Protected Storage So that no infected data is imported to the workspace

As an automated process in TRE I want that data with threats found be moved to Quarantined Storage So that infected be kept in a specific location outside the TRE

As an automated process in TRE I want that cleaned data to be moved to Internal Storage So that the import process advances phase and gets ready for manual data review by Data Steward

As a TRE Workspace Data Steward I want to be able see an airlock import process status by process id So that I validate current status, results of all the gates in the airlock process.

As a TRE Workspace Data Steward I want to be able to update the airlock import approval state So that the airlock import process terminates and data referenced in request gets deleted.

As a TRE User/Researcher I want to be able to access the data in the Workspace Storage So that I can copy/move data into the workspace shared storage

SvenAelterman commented 2 years ago

Please see my comment on the export issue: https://github.com/microsoft/AzureTRE/issues/33#issuecomment-1068017674

eladiw commented 2 years ago

I'd like to start a discussion on the suggested storage names for the airlock mechanism.

Import Suggested flow is : External Storage -> import request created -> (data moves to) Protected storage -> data scanned -> (if clean data moves to) Internal storage, (if threat found data moves to) Quarantine storage) -> (if request approved data moves to) Workspace storage, (if request rejected data is deleted)

External storage - A storage account with Internet access, items on it can be anything (clean or infected), unless a request to import is made nothing happens Protected storage - access is granted only to the airlock mechanism, can't be modified, once an item is uploaded, it triggers a scan Internal storage - access granted only to the airlock mechanism, files on this storage were scanned and found clean Quarantine storage - access granted only to the airlock mechanism, files were scanned and were identified as infected Workspace storage - access granted to the workspace user/researcher, files were scanned and were found clean

Export

Suggested flow is :

Workspace storage -> (data moves to) Internal storage -> (if request approved data moves to) External storage, (if request rejected) data deleted

definitions as above

jimdavies commented 2 years ago

I think that's an excellent idea. We should think carefully about what we do with an import that is rejected - it could be that we would prefer to simply delete the data/package in question - or it could be that we would wish to keep a copy of all content, approved or not, as part of our record of transactions, in case of a subsequent dispute - either way, the package or the copy needs to be deleted or stored safely and separately, as it is not ours (the TREs) to use or vouch for. We might think of there being five kinds of storage/access configurations/stages of the process - in addition to any separate archiving - with one of them not being our storage at all.

TRE airlock

Virus scanning, or any checking that requires internet access, could take place in the request storage - if we want to insist that the review storage is seen as part of the TRE in that sense.

The outgoing/export process is simply this in reverse.

jimdavies commented 2 years ago

Of course, it all depends upon how you want to do the 'locking' of the request data for review. Here, in a perhaps-naive attempt at simplification, I've seen this as happening when the TRE admin / steward copies the data into the airlock - into the review storage - for review. The user may then update the request storage, but that isn't going to change what is reviewed. It's the snapshot that matters. If we decide that we need to do scanning outside the review environment, then you need to lock the request environment to support this.

So looking at the suggested terms, 'external' is request, 'protected' is 'review', 'internal' is 'ready', and 'workspace' is 'inside'. And I am suggesting that 'quarantine' is more complicated.

marrobi commented 2 years ago

@jimdavies that's useful input.

I think what is slightly different we are thinking about a semi automated process, an "airlock service", with approvals that moves the data, rather than users doing the data moves.

charlescrichton commented 2 years ago

Terminology around the Import Airlock Process

The storage terminology there was:

I think , as does @jimdavies that "Rejected storage" is missing. So that the users can work out why their data was rejected by the Airlock? Perhaps also containing the reason for rejection? (Perhaps they uploaded the wrong files, seeing it with their own eyes my help them solve the issue.)

Import Airlock process: Storage or States?

I wonder whether there is really a difference between Protected / Internal / Quarantine / Rejected / Workspace storage, or is it the same data in different stages of processing?

Alternative Container with flagged state flow

Assuming that we copy files from the External Storage into a bespoke container for this process, flagged as a Request container.

This storage container SC can the move along the Airlock process...

  1. Check SC for viruses. If it contains viruses change SC status from Request to Quarantine. Airlock process stops. Virus notification process starts, otherwise change SC status from Request to Review.
  2. Review the SC data. The SC container in the Review status can be reviewed, then either moved to the Rejected status, and notification of this sent back to the user, or moved into the Workspace status and attached to the TRE Workspace.

Reviewing the data can be manual, or supported by tools and automated logic.

Advantage: This method has the benefit of one one area of storage, moving through states, rather than 5 areas of storage and copying.

Summary: This provides a container, then reviewing it before attaching it.

This process could easily be adapted for other types of data system we might want to attach, but we won't go into that here.

Definition of External Storage

The definition of External Storage may need to be refined:

Draft user stories around the External Storage Area

Permissions in External Storage

As a TRE User/Researcher/Workspace Owner/Workspace Data Steward: Having uploaded a file to the External Storage Area to go through the Import Airlock Process into a particular TRE Workspace. Whilst in the External Storage area, that file should only be accessible to the users operating the Import Airlock for that TRE Workspace. Other users should not have access to the file.

Similarly ...

As a TRE User/Researcher/Workspace Owner/Workspace Data Steward: Having moved a file into the External Storage Area from a TRE Workspace through an Export Airlock Process. That file should only be accessible to the users from the TRE Workspace. Other users should not have access to the file.

Institutional Policy for External Storage

As an institution running a TRE: We want to control where External Storage areas can transfer data to and from.

jimdavies commented 2 years ago

What's confusing me is that I have been seeing it in terms of services, rather than storage - but of course there is storage involved. The user needs to supply the package associated with an import request, and that can be done by placing it in a specific storage area and/or by calling a service that takes it off their hands, so to speak. Once they've done that, they don't need to do anything else until the data has been reviewed - upon which it is made available for them to take a copy and/or call a service that delivers the data into their workspace. Seeing all of this from a storage perspective makes me think of permissions for r/w rather than persistence of data for services, but it adds up to the same thing I think.

jimdavies commented 2 years ago

Ah, now Charlie's come in from the storage perspective.

charlescrichton commented 2 years ago

Export Airlock Flow using Storage Containers and States

Given some data in a TRE Workspace which needs to be exported:

  1. Attach an "Outbox Data Container" to the TRE.
  2. TRE User fills with data they wish to export, sets status to Request
  3. Request Container detached from TRE Workspace
  4. Virus Checks, if virus set to Quarantine else set to Review
  5. Review Data Storage Container reviewed then set to External
  6. External container contents copied to External Data Store and destroyed.

There is a lot more that needs to be defined in terms of what metadata is needed on the containers, but this is a bare-bones description of the process.

marrobi commented 2 years ago

@charlescrichton the challenge around just changing blob status is that endpoints and hence public/private connectivity is defined at the storage account level, not the container level.

If we don't use distinct accounts it opens up a risk that a user can extract data externally using credentials they are using internally.

Hence I believe there is the need for different storage locations - some with a public endpoint, some with just a private endpoint to the workspace, some private endpoint accessed solely by the "service".

jimdavies commented 2 years ago

TRE airlock

Okay. Rather than talking about read and write access...

In terms of who initiates the transitions, or who calls a service:

  1. the user initiates the first, making a request, and when this request is processed the data in question is locked or snapshotted in "Request" storage
  2. the admin initiates the second, presumably selecting from a task list in front of them, upon which the review can begin - while this could have been on the same storage, it may be better to think of the data being moved away from the place where it landed - logically - and into "Review" storage - it could be that some automatic checking is done as soon as the data has been received - e.g. virus
  3. when the admin is satisfied, they initiate the third, in which the approved data is made available for collection - the approval is recorded, of course, perhaps with a copy of the data in case of any subsequent dispute
  4. when the PI is ready, they can collect the data - in the earlier diagram, I was thinking of them being able to look at the data in the "Ready" storage before they copy it to their workspace
eladiw commented 2 years ago

Good inputs. I think we should separate between the technical details such as how we move the data to the logical process.

Logically, what I imagine are multiple containers/storage accounts for the different stages. Data MOVES across those locations, the researcher can write the data to the external storage, when the import request is made, that data MOVES to an 'advanced' location were no user can ever edit it again, until it is either in the rejected storage or the workspace storage. if the researcher wants to update the data, this is not possible, the user can create a new import request. when I say multiple locations, I mean that it might be one or more, maybe we decide to have a protected storage as I suggested for scanning, maybe even break it in the future to more storage containers...

same about the export.

permissions wise, even the admin shouldn't have permission to update data while in progress, read only permissions. data movement is done by the automated process, no user intervention (other than click 'approve/reject')

eladiw commented 2 years ago

I agree that adding a rejected storage + adding a deleted state to a request is needed

joalmeid commented 2 years ago

Based on recent inputs, and targeting an initial version for the airlock import, I've updated the draft user stories and diagram. These represent logical storage and implementation may differ.

eladiw commented 2 years ago

design is done