opensafely-core / job-runner

A client for running jobs in an OpenSAFELY secure environment, requested via job-server (q.v.)
Other
4 stars 5 forks source link

Abstract output handling from the executor. #425

Open bloodearnest opened 2 years ago

bloodearnest commented 2 years ago

Currently, executors must understand and implement matching output glob patterns, and handle unmatched patterns/outputs. It uses this to know which files to copy to persistent storage, primarily, but also communicates to the job loop in order to give users better debugging information.

This logic is common to all executors, and should ideally be handled by job-runner's main job handling, rather than having every executor independently implement it.

One way we could do this to split the finalise() api call into two, which I will tentatively call inspect() and persist().

Calling inpsect() will create a JobResults object like finalize(), with approriate exit code and message. But rather than outputs/unmatched_patterns/unmatched_outputs, it will have a single new_files property, which will be a flat list of new files that were generated by the action (like we already do with the timestamp file).

The Job loop will then inspect this new files, applying globs to the file paths, filtering unmatched patterns and outputs. Then it will call persist() with two lists of files, one for high privacy, the other for medium privacy. The api implementation of persist should: a) validate that the all files exist or else error b) validate that medium file list is a subset of high_privacy file list c) persist the files to their given long term storage.

This way, the executors only need to implement the "get newer files" logic, which is fairly simple, and don't need to care about output_spec or pattern matching.

bloodearnest commented 2 years ago

Note: we currently use the glob module for matching, which works against the file system. But under the hood, glob uses fnmatch, which works against strings, so this is a simple port.

evansd commented 2 years ago

I've wondered about something like this before myself. I do like the idea of shifting as much logic out of the executor as possible. I think there are a few considerations against it. I'm not saying I think it's definitely a bad idea, just want to make sure we've thought it all through.

One issue is that I think we need to make any API changes with an eye to a future world where job-runner is simple(r) and stateless and all the coordination is handled by job-server. This API would involve shipping the full file list back out of the secure environment. Now, arguably that's fine because we're planning to ship the matched filenames back anyway and, given that we support glob patterns, the opportunity for steganographic data exfiltration via filenames is the same either way. But I'd like to make sure we've thought that through and are comfortable with it before we commit to this API.

Another pair of related issues is: (a) The "find unmatched files" feature was only intended a useful local debugging aid, rather than something that ought to be implemented in production. As it happened, given that we used exactly the same code for both in the early days, we ended up getting the feature for free in production. But I'm not sure we need to continue supporting it there. (b) The timestamp-file trick via which this is currently implemented feels like a massive hack to me. I was comfortable enough with it because I only intended it as a local debugging aid. But I'd be nervous about making it a key part of job execution. Maybe there's a more robust approach here, like listing the entire filesystem before and after the job starts and getting the diff. I guess we can leave the implementation up to the executor. But I'd want to be sure there's at least one sensible way of doing it.

Finally,

Note: we currently use the glob module for matching, which works against the file system

I don't think this is true. We convert the glob patterns to regexes and then use find -regex to do the matching. See: https://github.com/opensafely-core/job-runner/blob/58e206bf7b8583d8dc9deb61311dc9df0c37aa62/jobrunner/lib/docker.py#L217-L264

If we decide not to implement the thing you're suggesting then I think we should at the least change the executor API to accept regexes, so we've got flexibility to expand the very simplistic glob matching we currently support. Of course, that all becomes irrelevant if we use your approach.

bloodearnest commented 2 years ago

Good point about the sending the full file list up to the server. I actually like that transparency, as it would be useful for user to debug what files their jobs created. But it is stuck on the question of what user-controlled data we can reasonably expose, which I think needs more discussion.

Regards unmatched files, I think it may still prove useful for production debugging. I want to push for as much metadata being released from job-runner as we can do safely, so I don't think it hurts. But it possibly could be documented as optional for the executor API?

Regards timestamp, there's ways to ask for files newer than an arbitrary timestamp, but the file is fine from my POV.

And about glob, yes, I'd forgotten the details. Either way, to matching globs against strings we can use fnmatch, so it should be fine.