[Discussion]: Free text survey question 5.a. Any other comments on network connectivity?

craddm commented 1 year ago

Summary

A summary of the free text (non-categorical responses) to the above question

Source

No response

Detail

Respondents agree that network connectivity should be tightly controlled to avoid failures of information governance, such as leakage of sensitive data from a TRE or even between workspaces within a TRE. But configurability is key. The needs of different projects vary widely, and while some may operate well without requiring network connectivity either outside or within the TRE, others may not. And as noted by one respondent, the relationship between network connectivity and contractual relationships between the data owners and those who require access to the data can be complex. Thus, the precise configuration of network connectivity may need to be decided on a case-by-case basis, which implies that a TRE should be configurable to suit the circumstances.

Network connectivity to resources outside the TRE should be restricted by default. However, there are many cases in which access to external resources may be useful or even essential. Access to external software repositories such as CRAN/PyPi is typically perceived as desirable, if not essential. Connectivity to allow import of project-specific code or data into a workspace may also be desirable. However, these needs may be met by an ingress procedure that does not require direct connectivity to the external resource from within the TRE (e.g. an airlock procedure mediated by TRE administrators). In general, it is desirable that a mechanism exists by which access to external resource can be provided on a project-by-project level, subject to information governance policy.

Some respondents express the position that external resources should also be TRE-like, or at least known and trusted. Again, this may vary on a case-by-case basis. Others note that external connectivity would be required for access to federated analytics services, which in principle should protect data privacy while allowing access to advanced computational resources.

Network connectivity within a TRE should be considered to enable collaboration, which several respondents considered to be essential. Again, this should be tightly controlled so as not to allow data leakage between workspaces. Thus, for example, users within a specific workspace should be able to collaborate and share code or data between themselves, but should not be able to link that data to datasets from outside that workspace. Projects should typically be isolated from one another.

It appears that the existing specification already covers most concerns of the respondents. The provisions of section 2.3 Network Management address isolation of workspaces through limitation of outbound connections and disallowing connectivity between users on different projects or with access to different datasets.

Section 2.1.2 Software tools covers the ability to access external resources such as CRAN/PyPi, and enabling collaboration within workspaces through shared tools such as databases and web apps accessible only within a given workspace. Notably, the section does not specify that such shared tools are required to be directly within the TRE, but only that they are shared only with users of a specific project. This does open the possibility for external resources being shared privately between project users.

Section 3.1. Data lifecycle management covers concerns about ingress and egress of data, in that it requires a TRE to have a process of ingress/egress that ensures all information governance policies are adhered to.

Some possible points of discussion:

The specification does not currently directly address code/software ingress, unless I've missed it. Thus, we may need to add a statement that covers connecting a mechanism for code/software ingress to info gov policy, comparable to the current statements for data.
We could consider making the sharing of collaborative tools be restricted to being within a TRE, but leaving the door slightly ajar allows for cases where this might be what the users/data owners want to be possible.

Intended Output

No response

Who can help

@sa-tre/spec-maintainers

edwardchalstrey1 commented 1 year ago

The specification does not currently directly address code/software ingress, unless I've missed it. Thus, we may need to add a statement that covers connecting a mechanism for code/software ingress to info gov policy, comparable to the current statements for data.

If we're going to add a statement on software ingress, it shouldn't go in 3.1. Data lifecycle management - we might want to be careful how we word this so as to be inclusive to different TRE types. In DSH, I think we have a software ingress process which is the same as for data, whereas in OpenSAFELY code ingress is happening every time an analysis is run.

I suggest:

We add to 2.1.2 Software tools a statement similar to one that's in Data lifecycle management along the lines of "You must have an ingress process for software tools which enforces information governance rules/processes." - this would cover situations like installing a new package that wasn't in the TRE when it was set up
Separately (I'm not sure under which pillar or capability), we should include a statement on "Code ingress", which is a distinct thing from the ingress of software tools. This covers the ingress of actively developed analysis code during a project

edwardchalstrey1 commented 1 year ago

@sa-tre/spec-maintainers thoughts on this? ^

manics commented 1 year ago

I think it'd be worth stating what we see as the difference between data, software and code before creating any statements, since it's not always clear cut. Are there some general principles we can come up with for all ingress/egress? For example:

Ingress of actively developed code may require some sort of automated streaming ingress, but we may also want to support streaming or live data feeds. What are the similarities and differences between how we treat these two?
Ingress of AI models- Is this software or data? Does it make a difference if the model is bundled in the software package? How big does a model (or other set of parameters) have to be before it crosses the line from a software configuration file to a data file?

manics commented 1 year ago

Another thought: streaming/automated "egress". For example, a restricted limited dashboard may be judged acceptable to automatically show a live output. In https://github.com/sa-tre/satre-specification/pull/132 we discussed live reporting of data use- but this is also a form of egress.

sa-tre / satre-specification