Closed marrobi closed 2 years ago
Given overlap with #33 suggest handle #33 first.
I have been giving some thought to the process of getting data into and out of the TRE, as a minimum we could potentially follow a simple process as outlined in the diagram below and utilise the event management process from within the storage areas. Files can be copied using Storage Explorer and the events of file creation and deletion & notification audited automatically. The same process can be reversed for getting data out of the TRE, but in essentially no data can leave the TRE unless the PI physically copies it - which will be audited so the process for getting data in is the same as getting it out.
I do not believe it is possible to virus scan once inside the TRE that would have to be a function outside, happy to be corrected on that one if there is a way.
Walking through the process and taking the requirements from 11 Jan into account we can do away with the staging area and simplify the process further.
1. Researcher uploads data via specific workspace in the TRE web portal No, only the PI should be able to copy agreed data into (or out of the Workspace) - only the PI has authority so they should be the ones with the security access to do that. This creates a real-world airgap in the form of the PI. 2. Data is scanned for viruses (and potentially PII) I believe this can only be outside of the TRE, therefore done through standard virus checking processes 3. Data import approver receives notification. They log onto a secure VM and can view file, and results of scans. In the example below the approver is the PI and is the one controlling the data movement, it is the researcher that is notified when the PI copies the data into the workspace. 4. Once approved data gets moved to user storage in the workspace, and researcher notified This step not required because is the PI who controls the data movement process not the researcher and is accommodated in step 3.
This process solves the following:
The "airgap" is created by default and is a function of the security credentials of the PI. Unless the PI actively does something data does not move.
We've went through a set of existing requirements, guidance from HDRUK and current inputs on GH airlock import/export. Without taking any look into implementation details, there's a suggestion for the user stories of airlock import. Main goal is to fine tune user stories, and finally add it to GH Issues.
Considerations:
Main Goals: ○ Import certain input data: data files, ML Models, SQL database, csvs, code snippets ○ Prevent malware inside TRE and/or TRE workspaces ○ Automated and approval process built-in
Envisioned high-level Workflow
Workspace Storage
, and requester gets notifiedgraph TD
B(External Storage) -->|Import Request created| C(fa:fa-spinner Protected Storage)
C --> D{Gates - Malware Scanning}
D --> |Clean data| E[Internal Storage]
D --> |Threats Found| F[Quarantine Storage]
E --> G{Approval?}
G --> |Approved| H[Workspace Storage]
G --> |Rejected| I[Rejected Storage]
Draft User stories:
As a TRE User/Researcher/Workspace Owner/Workspace Data Steward
I want to be able to upload data file (zip) to the External Storage
So that I make the data available to start the airlock import process
External Storage
As a TRE User/Researcher/Workspace Owner/Workspace Data Steward I want to be able to invoke the airlock import process So that the data is available to start the airlock import process
Protected Storage
External Storage
Protected Storage
As an automated process in TRE I want to execute the gates defined in the import process by the TRE So that I guarantee safety of the data
As an automated process in TRE
I want to execute a virus scanning gate on the data in the Protected Storage
So that no infected data is imported to the workspace
As an automated process in TRE
I want that data with threats found be moved to Quarantined Storage
So that infected be kept in a specific location outside the TRE
As an automated process in TRE
I want that cleaned data to be moved to Internal Storage
So that the import process advances phase and gets ready for manual data review by Data Steward
Internal Storage
Internal Storage
must be read-onlyAs a TRE Workspace Data Steward I want to be able see an airlock import process status by process id So that I validate current status, results of all the gates in the airlock process.
As a TRE Workspace Data Steward I want to be able to update the airlock import approval state So that the airlock import process terminates and data referenced in request gets deleted.
Internal Storage
to the Workspace Storage
Internal Storage
to the Rejected Storage
Internal Storage
As a TRE User/Researcher
I want to be able to access the data in the Workspace Storage
So that I can copy/move data into the workspace shared storage
Workspace Storage
is only available to the requesting TRE User/ResearcherWorkspace Storage
Workspace Storage
for auditing purposePlease see my comment on the export issue: https://github.com/microsoft/AzureTRE/issues/33#issuecomment-1068017674
I'd like to start a discussion on the suggested storage names for the airlock mechanism.
Import Suggested flow is : External Storage -> import request created -> (data moves to) Protected storage -> data scanned -> (if clean data moves to) Internal storage, (if threat found data moves to) Quarantine storage) -> (if request approved data moves to) Workspace storage, (if request rejected data is deleted)
External storage - A storage account with Internet access, items on it can be anything (clean or infected), unless a request to import is made nothing happens Protected storage - access is granted only to the airlock mechanism, can't be modified, once an item is uploaded, it triggers a scan Internal storage - access granted only to the airlock mechanism, files on this storage were scanned and found clean Quarantine storage - access granted only to the airlock mechanism, files were scanned and were identified as infected Workspace storage - access granted to the workspace user/researcher, files were scanned and were found clean
Export
Suggested flow is :
Workspace storage -> (data moves to) Internal storage -> (if request approved data moves to) External storage, (if request rejected) data deleted
definitions as above
I think that's an excellent idea. We should think carefully about what we do with an import that is rejected - it could be that we would prefer to simply delete the data/package in question - or it could be that we would wish to keep a copy of all content, approved or not, as part of our record of transactions, in case of a subsequent dispute - either way, the package or the copy needs to be deleted or stored safely and separately, as it is not ours (the TREs) to use or vouch for. We might think of there being five kinds of storage/access configurations/stages of the process - in addition to any separate archiving - with one of them not being our storage at all.
Virus scanning, or any checking that requires internet access, could take place in the request storage - if we want to insist that the review storage is seen as part of the TRE in that sense.
The outgoing/export process is simply this in reverse.
Of course, it all depends upon how you want to do the 'locking' of the request data for review. Here, in a perhaps-naive attempt at simplification, I've seen this as happening when the TRE admin / steward copies the data into the airlock - into the review storage - for review. The user may then update the request storage, but that isn't going to change what is reviewed. It's the snapshot that matters. If we decide that we need to do scanning outside the review environment, then you need to lock the request environment to support this.
So looking at the suggested terms, 'external' is request, 'protected' is 'review', 'internal' is 'ready', and 'workspace' is 'inside'. And I am suggesting that 'quarantine' is more complicated.
@jimdavies that's useful input.
I think what is slightly different we are thinking about a semi automated process, an "airlock service", with approvals that moves the data, rather than users doing the data moves.
Terminology around the Import Airlock Process
The storage terminology there was:
I think , as does @jimdavies that "Rejected storage" is missing. So that the users can work out why their data was rejected by the Airlock? Perhaps also containing the reason for rejection? (Perhaps they uploaded the wrong files, seeing it with their own eyes my help them solve the issue.)
Import Airlock process: Storage or States?
I wonder whether there is really a difference between Protected / Internal / Quarantine / Rejected / Workspace storage, or is it the same data in different stages of processing?
Alternative Container with flagged state flow
Assuming that we copy files from the External Storage into a bespoke container for this process, flagged as a Request
container.
This storage container SC
can the move along the Airlock process...
SC
for viruses. If it contains viruses change SC
status from Request
to Quarantine
. Airlock process stops. Virus notification process starts, otherwise change SC
status from Request
to Review
.SC
data. The SC
container in the Review
status can be reviewed, then either moved to the Rejected
status, and notification of this sent back to the user, or moved into the Workspace
status and attached to the TRE Workspace.Reviewing the data can be manual, or supported by tools and automated logic.
Advantage: This method has the benefit of one one area of storage, moving through states, rather than 5 areas of storage and copying.
Summary: This provides a container, then reviewing it before attaching it.
This process could easily be adapted for other types of data system we might want to attach, but we won't go into that here.
Definition of External Storage
The definition of External Storage may need to be refined:
Draft user stories around the External Storage Area
Permissions in External Storage
As a TRE User/Researcher/Workspace Owner/Workspace Data Steward: Having uploaded a file to the External Storage Area to go through the Import Airlock Process into a particular TRE Workspace. Whilst in the External Storage area, that file should only be accessible to the users operating the Import Airlock for that TRE Workspace. Other users should not have access to the file.
Similarly ...
As a TRE User/Researcher/Workspace Owner/Workspace Data Steward: Having moved a file into the External Storage Area from a TRE Workspace through an Export Airlock Process. That file should only be accessible to the users from the TRE Workspace. Other users should not have access to the file.
Institutional Policy for External Storage
As an institution running a TRE: We want to control where External Storage areas can transfer data to and from.
What's confusing me is that I have been seeing it in terms of services, rather than storage - but of course there is storage involved. The user needs to supply the package associated with an import request, and that can be done by placing it in a specific storage area and/or by calling a service that takes it off their hands, so to speak. Once they've done that, they don't need to do anything else until the data has been reviewed - upon which it is made available for them to take a copy and/or call a service that delivers the data into their workspace. Seeing all of this from a storage perspective makes me think of permissions for r/w rather than persistence of data for services, but it adds up to the same thing I think.
Ah, now Charlie's come in from the storage perspective.
Export Airlock Flow using Storage Containers and States
Given some data in a TRE Workspace which needs to be exported:
Request
Request
Container detached from TRE WorkspaceQuarantine
else set to Review
Review
Data Storage Container reviewed then set to External
External
container contents copied to External Data Store
and destroyed.There is a lot more that needs to be defined in terms of what metadata is needed on the containers, but this is a bare-bones description of the process.
@charlescrichton the challenge around just changing blob status is that endpoints and hence public/private connectivity is defined at the storage account level, not the container level.
If we don't use distinct accounts it opens up a risk that a user can extract data externally using credentials they are using internally.
Hence I believe there is the need for different storage locations - some with a public endpoint, some with just a private endpoint to the workspace, some private endpoint accessed solely by the "service".
Okay. Rather than talking about read and write access...
In terms of who initiates the transitions, or who calls a service:
Good inputs. I think we should separate between the technical details such as how we move the data to the logical process.
Logically, what I imagine are multiple containers/storage accounts for the different stages. Data MOVES across those locations, the researcher can write the data to the external storage, when the import request is made, that data MOVES to an 'advanced' location were no user can ever edit it again, until it is either in the rejected storage or the workspace storage. if the researcher wants to update the data, this is not possible, the user can create a new import request. when I say multiple locations, I mean that it might be one or more, maybe we decide to have a protected storage as I suggested for scanning, maybe even break it in the future to more storage containers...
same about the export.
permissions wise, even the admin shouldn't have permission to update data while in progress, read only permissions. data movement is done by the automated process, no user intervention (other than click 'approve/reject')
I agree that adding a rejected storage + adding a deleted state to a request is needed
Based on recent inputs, and targeting an initial version for the airlock import, I've updated the draft user stories and diagram. These represent logical storage and implementation may differ.
design is done
Organizations wish to have control of data that is imported into a workspace to prevent malicious software being installed, and datasets that allow relinking and hence identification of individuals.
A high level ingress workflow may be: