Understand how researchers may generate graphs outside of the secure environment

lucyb commented 4 months ago

We need to document this process, so that researchers do the right thing (and not commit outputs to git).

This will also help to determine the scope of the initiative, because if researchers can't use released or published outputs without adding them to a git repo, it will stop them from using Codespaces later in the analysis pipeline (and tells us what one of the jumping off points is).

In this issue we should investigate how a researcher may use released outputs in a "local" analysis step to generate graphs for publication:

[x] Do we want to allow released outputs to be used or should they be published?
[x] What's the best way to access the outputs that are released to Job Server from within Codespaces? Can they be dragged and dropped? (Answer: yes, it's possible to drag and drop files, or use the upload option)
[x] How do they retrieve the generated graphs? (Answer: it's possible to download any file from Codespaces, this includes graphs and other outputs from running opensafely within the Codesapce)

Jongmassey commented 4 months ago

Chatting to @alschaffer earlier, she does various bits of analysis as well as graph plotting outside the secure environment, too

Jongmassey commented 3 months ago

This is something I've thought about a lot and I don't have a good answer about what is the best approach

Considerations:

researcher ease
- submitting/waiting for jobs to complete
- applying/waiting for output checking of results
- fast feedback loops/more interactive approach
- greater range of libraries available
output checker ease
- minimise repetition for minor tweaks
- simple tabular data easier to check
- but also we don't want huge numbers of huge tables
IG concerns
- if we release something we shouldn't have - how easy is it to minimise its spread
- where are released but not approved outputs going?
maximise reproducibility
- Can I as an researcher external to the project run the code end-to-end
- Can I see the detail of how the analysis was done and the figures were produced

Jongmassey commented 3 months ago

With reproducibility in mind, we encourage researchers to commit all of their code to the study repository; and there is an argument for adding these post-release analysis/visualisation actions to the project.yaml. However, we do not particularly want the released-but-not-approved outputs in git (per the bad old days) as this is extremely challenging in case of needing to un-release something.

Users could drag-and-drop released outputs into a released_outputs folder which is added to the .gitignore - would this be acceptable to IG? We already have little knowledge/control of where these files go when downloaded from Job Server

Jongmassey commented 3 months ago

Mounting the released files (WebDAV or similar?) from Job Server in the codespace automagically would be very shiny and cool.

Jongmassey commented 3 months ago

Do we want to allow released outputs to be used or should they be published?

This is quite a high bar, the NHSE panel that approves published results only meets every two weeks and there's quite a bit of faff involved. Having a consistent policy re: released-but-not-published outputs and a set of recommendations/best practices would be preferable.

sebbacon commented 3 months ago

Regarding the reproducibility point, which I think is very important - we've talked before about a kind of "local extension" idea for project.yaml/opensafely-cli

Something like:

special label for actions which can run outside a secure environment
these could be run from a user's cli in the first instance, which would require:
- some kind of "sync outputs with job server" step (e.g. Jon's webdav suggestion - though this might cause issues regarding people or tools to expect read/write?), or opensafely sync-outputs or similar)
- (potentially) special-casing in jobrunner for "local mode" which doesn't attempt to run "local mode" actions in TPP/EMIS backends (though it could).

All this obviously needs a lot more careful thought, but wanted to throw it into the ring. I think I recall @bloodearnest having some thoughts around it.

lucyb commented 3 months ago

@sebbacon these are good suggestions, but this ticket is solely for understanding when/how researchers should use Codespaces (this might not have been evident from the description). We realise that researchers may want to do some work outside a secure environment and want to know if we can/should support that with Codespaces or whether we should be pointing them to do the work on their local machine. In particular, we're concerned about the storage and processing of released (but not published) outputs.

I think Jon's comment here about how researchers could use Codespaces with released files is the most pertinent:

Users could drag-and-drop released outputs into a released_outputs folder which is added to the .gitignore - would this be acceptable to IG? We already have little knowledge/control of where these files go when downloaded from Job Server

Jongmassey commented 3 months ago

opensafely sync-outputs with an auth pop to Job Server feels like a nice approach

bloodearnest commented 3 months ago

A few comments:

./output is already in the default .gitignore, for this reason
committing outputs to github repo is simply not an option, IMO. We've had to redact things that turned out to be disclosive a year after they were released.
having easy way to get the currently released outputs for a workspace as if they had been generated locally has a lot of UX benefits. This issue already mentions generating images from released data, but also general paper preparation (which we don't seem to have the same reproducibility concerns about)
having a way to authenticate from the cli opens up the mentioned sync-outputs, but also
- submitting jobs from the cli
- checking on status of jobs from the cli
- release files from the cli (option 2 below).
- better user telemetry

Broadly, when we've discussed this before, there were two approaches we considered:

1) submit graph generating jobs to run in a special non-sensitive backend

this backend has the currently released files in its local workspace
runs action as normal on the released outputs
automatically releases outputs to job-server, no output-checking

2) run graph generating jobs locally and release locally generated outputs to job-server

user gets latest outputs (e.g sync-outputs)
commits code and pushes commit to github repo.
runs job locally and generates outputs
user releases local outputs to job-server
- opensafely-cli validates that these jobs were generated by a commit id that is published to github repo.
- pushes outputs up to job-server, as if from "user" backend

I think Option 1) is a mostly RAP built solution, basically a whole new backend with some special job preparation and finalisation code.

FWIW: I am of the opinion we do not need the same level of reproduciblity and audit trail for generating images from released data as we do for generating L4 data to be released.

where does that stop? Do papers need to have similar requirements?
we don't care about font tweaks or journal imposed style changes.
opensafely workflow is already slow enough.
nothing in the above requires users to do this, so I just think they won't. I bet many savvy users are already not doing this when they prepare papers.

lucyb commented 3 months ago

Some security considerations for the decision around allowing released outputs to be uploaded to a Codespace:

Only the creator of a Codespace can access that Codespace directly (thread)
It's possible to enable Live Share and publish the URL, giving someone full access to the Codespace.
Alternatively, it's possible to enable a web service and make it either organisation-accessible or to have a publicly accessible forwarded port (however, this uses a random URL so would require that to be published somewhere) — this could give others access to files if you're running saying Jupyter in a Codespace.

This means that it requires active input from the creator of the Codespace in order for someone else to gain access to the files. In the case of port-forwarding, it's possible to configure that to start by default, but the URL would not be guessable and would require the creator to share it with others.

lucyb commented 3 months ago

From Seb in this thread:

The principle, then, seems to be:

It's OK for released outputs to be copied from the Job Server with the immediate team

It's OK for them to be released more widely (but not published) with close/senior scientific collaborators, subject to review within their own team (no formal approvals needed)

Only publication requires a formal signoff

However, we would want to make it very clear in the existing documentation with some examples of acceptable movement of outputs, presumably by adding some words in that Policies for Researchers doc. For example, we might say it's OK to copy outputs elsewhere for further processing, as long as the results of that processing are subject to the same controls.

Actions:

[ ] Seb to confirm with Amir whether it's permitted to store and process the released outputs outside of the UK.
[ ] If appropriate, agree updated wording in the Policies for Researchers doc.
[ ] Include some wording the Codespaces section of the opensafely docs to point to the Policies for Researchers when describing how Codespaces might be used.
[ ] Check whether the new Codesapces page in the team manual needs updating (see ticket).

StevenMaude commented 3 months ago

Some security considerations for the decision around allowing released outputs to be uploaded to a Codespace …

I read the following today. It might not be important, but I'll flag it because it's quirky and unexpected behaviour.

From GitHub's documentation:

Once a user loses access to a codespace, the codespace is retained for a period of 7 days, then it is permanently deleted. During this 7-day period, to recover unpublished work from the codespace, the user must contact us through the GitHub Support portal.

I'm not sure what this support process entails. Can the removed user circumvent the organization in recovering data from an organization-owned codespace? You'd hope not, but I don't know.

I suspect, but haven't tested, that the hosting organization can still delete the codespace (because they'll be presumably incurring storage costs for it).

lucyb commented 2 months ago

I presented the question about the use of released outputs in a Codespace to Amir at an IG clinic on 1/5/24. He concluded that it is fine to allow researchers to use released outputs in a Codespace, with the following caveats around sharing:

As long as the codespace tooling meets the same sharing aspects of a webinar or email where results are shared in confidence, then it can be used as another method for sharing results in confidence, or equally a place where now anonymous results (because they have been output-checked) can be manipulated into graphs etc.

The full write up.

opensafely-core / codespaces-initiative

Understand how researchers may generate graphs outside of the secure environment #23