Provide job metadata for level 4 files in some way

bloodearnest commented 10 months ago

Currently, the only state release-hatch has access to is the Level 4 storage directory. As such, the only data it has is the files themselves, plus the legacy bits of per-directory data in metadata/metadata.json (basically just repo and workspace name). It might be a good idea to add branch to that metadata, perhaps. Although possibly not actually necessary.

Specifically, it does not know what job created a particular file, or what user ran that job, or what commit of code was used. This is all useful information we want to be available as part of the release process, so release-hatch needs to be able to read (not write) it.

Note: the commit is actually required if we're going to provide a code browser.

All that data is in the job-runner sqlite db, which release-hatch by design doesn't have access to, as release-hatch is a network service exposes to level 4 users. We need to find a way to provide this metadata so release-hatch can make use of it. Ideally, this doesn't involve providing access to the job-runner sqlite db or otherwise compromising release-hatch's isolation from job-runner.

The obvious way would be to return to writing per-file metadata into the metadata/metadata.json file when we copy level 4 files into the medium_privacy directory. There may be alternative options.

Which ever way we decide as an architecture for the releases system, this metadata will need providing, so we can maybe start on it now.

bloodearnest commented 9 months ago

Regards the obvious way to do it, we could return to having a per-file entry in metadata.json that records the job id, action name, and commit that generated that file, kinda like we used to do. Every time we copy a job's medium_sensitivity files file to level 4, we'd update that metadata.json file.

Note that eventually, we plan to send the filename information up to job-server, for various reasons. If we do that, this work will be unnecessary. As such, it may only be a temporary solution, so we maybe shouldn't worry to much about the implementation.

bloodearnest commented 6 months ago

I think it would be good to capture lots of information for each L4 file in the manifest.json

So, the files json might look like:

"files": {
    "output/foo.txt": {
       "job_id": ...,
       "job_request": ..., 
       "action_name": ...,
       "user": ...,
       "size": 1234567789,
       "timestamp": 123456789,  # time copied to L4
       "content_hash": ..., 
    },
    ...

This gives us a wealth of information, and in particular, it means we don't need to stat() or hash every file to get this info, we can just load the manfiest.json once.

bloodearnest commented 6 months ago

Update: I am currently running the backfill on level 4, which is why this is not yet complete.

It should be done by monday

bloodearnest commented 6 months ago

This is done, backfill complete

opensafely-core / job-runner

Provide job metadata for level 4 files in some way #701