Run restricted dataset check on job server

Jongmassey commented 2 years ago

via @amirmehrkar :

restricted dataset checking now present in opensafely-cli and research-action
it may still be possible for a user to disregard the warnings from the cli or the action, or disable the action, and run a job which unauthorised use of restricted datasets (such as ICNARC).

My thoughts from a bit of investigation:

since job-runner doesn't use the opensafely cli it's not particularly viable to add this functionality by this route
I've successfully been able to query the github API to see whether a given commit has had the research action run to successful completion so this may be a viable approach

bloodearnest commented 2 years ago

The check would best be performed by job-server, and avoid submitting the job to a backend. job-runner has no permissions model and just does what job-server tells it.

That said, the same issue exists, but it's much easier to add opensafely-cli to job-server than it is to job-runner.

bloodearnest commented 2 years ago

Have moved this issue over to job-server, as per comment above

Jongmassey commented 2 years ago

Using the API something a bit like this: @bloodearnest @ghickman (obviously needs integration into the main job workflow etc, but here's my initial thoughts which seem to work)

https://github.com/opensafely-core/job-server/commit/7d74758db35fc9345a4ac18e4c95e3373054f172#diff-c62ad9e4fbb73f94642ba9e9b2fb4365c47fb4b0016e701fc9f772ca8b2b3a8c

ghickman commented 2 years ago

What are we trying to restrict with this? Access to the page? Running specific actions on the Run Jobs page?

Jongmassey commented 2 years ago

Sorry for not providing more context. As it stands the ICNARC dataset comes with conditions on its use such that only approved research projects may use it. The mabs/antivirals dataset that is inbound will have similar strings attached.

At the moment the checking is performed in opensafely check command of opensafely-cli, which is also run as part of opensafely run. A dataset check failure will also cause the research action to fail and create an issue on the offending repo tagging me, Amir and Simon.

A user could ignore this warning and kick off a job based on a commit that failed this check (or could disable the research action in their repo). We want to be able to prevent this.

The approach I suggested checks a given sha (which I see is a property of a JobRequest) to see if the most recent run of the research action was successful. Alternatively, the opensafely cli could be integrated into job server such that it runs opensafely check itself on a job request before progressing (which I think was Simon's suggestion).

ghickman commented 2 years ago

This makes a lot more sense, thank you!

Am I right in thinking that the research-action/opensafely check is checking the entire research repo for unauthorized dataset access, not just a particular action?

Jongmassey commented 2 years ago

It's just searching all .py files in the repository for lines that contain the function(s) which access the restricted dataset (and aren't commented out) using regular expressions.

ghickman commented 2 years ago

Aha, ok! We'll need to make this check before we render the Run Jobs page then. Unfortunately this means making changes to JobRequestCreate, the absolute worst view, so I think it's probably time for me to think about how we break that down.

I think there's possibly some simpler options for making the check. My main concern currently is that, at a minium, we make 3 blocking external requests, even from our server to GitHub that's a large overhead to pay. Do we need to show the user anything from the check metadata? I suspect we can start with "you've accessed a dataset you don't have permission to, for more information see the docs" with a link to the docs and maybe to the failing workflow too.

Some other options I'd like to take a look at:

Checks endpoints
A GraphQL query to try and get the data in one go

sebbacon commented 2 years ago

The correct way of doing this is to add per-dataset permissions metadata to our data model in job server, i.e. for Project A, users 1 and 2 have permissions to use datasets x, y and z. This would flow naturally from our "Contracts" work for databuilder. Ideally this info would be checked at runtime by cohortextractor to avoid, e.g., race conditions on permissions.

Obviously this isn't going to happen very soon. The question then becomes how urgent this is to fix.

There had been a decision some months from Amir + Ben that the current situation is "good enough". Before doing any more work on a short-term fix we know will be superseded, we should go back to them for prioritisation guidance

Jongmassey commented 2 years ago

Certainly, I just wanted to put the results of my early investigations somewhere so I wouldn't forget them/they'd be lost.

From conversations with Amir and Ben such as https://ebmdatalab.slack.com/archives/C31D62X5X/p1643202637303200?thread_ts=1643191472.293300&cid=C31D62X5X I'm getting the feeling that this may come up the priority list soon.

Jongmassey commented 2 years ago

@ghickman

A GraphQL query to try and get the data in one go

https://github.community/t/graphql-actions-api/14793 - last time I looked, Actions weren't available in the graph API (just to save you from that particular rabbit hole), hence why my back-of-the-envelope sketch used the REST API (couldn't figure out how to get all the required info in less than 3 calls).

bloodearnest commented 2 years ago

Hmm, so I'm not sure we should be checking actions. That feels brittle.

job-server checks out the project from github to grab the project.yaml I think. Why don't we re-run the simple check there?

ghickman commented 2 years ago

Ideally we wouldn't run the actual check in job-server:

we don't check out repos (we pull project.yamls via the API)
I'd prefer to avoid rolling opensafely-cli into job-server's requirements if we can help it
we want this check to run before we render the Run Jobs page and cloning then checking a repo is a significant overhead to that

There's two ways to ask the API for these details, via the Workflow API which Jon has spiked already, and via the Checks API, which is probably a better fit for our use case.

bloodearnest commented 2 years ago

Ah, my mistake, I thought we checked out already (fwiw I think we could maybe duplicate the check logic in job-server, or pull it into a different mutual dependency)

Requiring the right Check to pass feels like a suboptimal failure mode. If the Check fails, then you can't run your job. While that process is not without merit, it would be a rough transition for users, some of whom do not have their tests passing in GH actions as a baseline. Is the Checks API fine grained enough to only require only the opensafely check step passed, or is the whole workflow/action?

Also, if the check is a false negative for any reason (e.g. GH actions broken, or PyPI down, or we mess up an opensafely-cli release or the research action yaml update) then would we block all jobs from running?

ghickman commented 2 years ago

The Checks API gives you each check you see on a PR (that's what drives it!), so we can target any job defined in a workflow, the same way required checks work in a repo's config.

I'd have to dig into how the research-action works to get a better idea of it being fine-grained enough for us.

I think we have enough information here now to prioritise ideas when this ticket gets moved up the queue:

Can the Checks API (REST or GraphQL) and the existing workflow give us the information we need without blocking researchers unnecessarily?
What is required for job-server to start checking out code and running tests on it?

I'm happy to park this until we need to prioritise it.

bloodearnest commented 2 years ago

A last thought - this grepping of code is a temporary solution until we have contracts and associated permissions, so job-server checking out code should probably be avoided.

opensafely-core / job-server

Run restricted dataset check on job server #1543