opensafely-core / codespaces-initiative

Improving the use of OpenSAFELY in Codespaces
MIT License
0 stars 0 forks source link

Understand how researchers may generate graphs outside of the secure environment #23

Closed lucyb closed 2 months ago

lucyb commented 4 months ago

We need to document this process, so that researchers do the right thing (and not commit outputs to git).

This will also help to determine the scope of the initiative, because if researchers can't use released or published outputs without adding them to a git repo, it will stop them from using Codespaces later in the analysis pipeline (and tells us what one of the jumping off points is).

In this issue we should investigate how a researcher may use released outputs in a "local" analysis step to generate graphs for publication:

Jongmassey commented 4 months ago

Chatting to @alschaffer earlier, she does various bits of analysis as well as graph plotting outside the secure environment, too

Jongmassey commented 3 months ago

This is something I've thought about a lot and I don't have a good answer about what is the best approach

Considerations:

Jongmassey commented 3 months ago

With reproducibility in mind, we encourage researchers to commit all of their code to the study repository; and there is an argument for adding these post-release analysis/visualisation actions to the project.yaml. However, we do not particularly want the released-but-not-approved outputs in git (per the bad old days) as this is extremely challenging in case of needing to un-release something.

Users could drag-and-drop released outputs into a released_outputs folder which is added to the .gitignore - would this be acceptable to IG? We already have little knowledge/control of where these files go when downloaded from Job Server

Jongmassey commented 3 months ago

Mounting the released files (WebDAV or similar?) from Job Server in the codespace automagically would be very shiny and cool.

Jongmassey commented 3 months ago

Do we want to allow released outputs to be used or should they be published?

This is quite a high bar, the NHSE panel that approves published results only meets every two weeks and there's quite a bit of faff involved. Having a consistent policy re: released-but-not-published outputs and a set of recommendations/best practices would be preferable.

sebbacon commented 3 months ago

Regarding the reproducibility point, which I think is very important - we've talked before about a kind of "local extension" idea for project.yaml/opensafely-cli

Something like:

All this obviously needs a lot more careful thought, but wanted to throw it into the ring. I think I recall @bloodearnest having some thoughts around it.

lucyb commented 3 months ago

@sebbacon these are good suggestions, but this ticket is solely for understanding when/how researchers should use Codespaces (this might not have been evident from the description). We realise that researchers may want to do some work outside a secure environment and want to know if we can/should support that with Codespaces or whether we should be pointing them to do the work on their local machine. In particular, we're concerned about the storage and processing of released (but not published) outputs.

I think Jon's comment here about how researchers could use Codespaces with released files is the most pertinent:

Users could drag-and-drop released outputs into a released_outputs folder which is added to the .gitignore - would this be acceptable to IG? We already have little knowledge/control of where these files go when downloaded from Job Server

Jongmassey commented 3 months ago

opensafely sync-outputs with an auth pop to Job Server feels like a nice approach

bloodearnest commented 3 months ago

A few comments:

Broadly, when we've discussed this before, there were two approaches we considered:

1) submit graph generating jobs to run in a special non-sensitive backend

2) run graph generating jobs locally and release locally generated outputs to job-server

I think Option 1) is a mostly RAP built solution, basically a whole new backend with some special job preparation and finalisation code.

FWIW: I am of the opinion we do not need the same level of reproduciblity and audit trail for generating images from released data as we do for generating L4 data to be released.

lucyb commented 3 months ago

Some security considerations for the decision around allowing released outputs to be uploaded to a Codespace:

  1. Only the creator of a Codespace can access that Codespace directly (thread)
  2. It's possible to enable Live Share and publish the URL, giving someone full access to the Codespace.
  3. Alternatively, it's possible to enable a web service and make it either organisation-accessible or to have a publicly accessible forwarded port (however, this uses a random URL so would require that to be published somewhere) — this could give others access to files if you're running saying Jupyter in a Codespace.

This means that it requires active input from the creator of the Codespace in order for someone else to gain access to the files. In the case of port-forwarding, it's possible to configure that to start by default, but the URL would not be guessable and would require the creator to share it with others.

lucyb commented 3 months ago

From Seb in this thread:

The principle, then, seems to be:

  • It's OK for released outputs to be copied from the Job Server with the immediate team
  • It's OK for them to be released more widely (but not published) with close/senior scientific collaborators, subject to review within their own team (no formal approvals needed)
  • Only publication requires a formal signoff

However, we would want to make it very clear in the existing documentation with some examples of acceptable movement of outputs, presumably by adding some words in that Policies for Researchers doc. For example, we might say it's OK to copy outputs elsewhere for further processing, as long as the results of that processing are subject to the same controls.

Actions:

StevenMaude commented 3 months ago

Some security considerations for the decision around allowing released outputs to be uploaded to a Codespace …

I read the following today. It might not be important, but I'll flag it because it's quirky and unexpected behaviour.

From GitHub's documentation:

Once a user loses access to a codespace, the codespace is retained for a period of 7 days, then it is permanently deleted. During this 7-day period, to recover unpublished work from the codespace, the user must contact us through the GitHub Support portal.

I'm not sure what this support process entails. Can the removed user circumvent the organization in recovering data from an organization-owned codespace? You'd hope not, but I don't know.

I suspect, but haven't tested, that the hosting organization can still delete the codespace (because they'll be presumably incurring storage costs for it).

lucyb commented 2 months ago

I presented the question about the use of released outputs in a Codespace to Amir at an IG clinic on 1/5/24. He concluded that it is fine to allow researchers to use released outputs in a Codespace, with the following caveats around sharing:

As long as the codespace tooling meets the same sharing aspects of a webinar or email where results are shared in confidence, then it can be used as another method for sharing results in confidence, or equally a place where now anonymous results (because they have been output-checked) can be manipulated into graphs etc.

The full write up.