lanl / dsi

LANL Data Science Infrastructure Project
https://lanl.github.io/dsi
5 stars 3 forks source link

Git-based provenance artifact #43

Open qwofford opened 1 year ago

qwofford commented 1 year ago

Dockerfiles are used to create provenance for the creation of a container. They are scripts executed to generate a container. There is a similar provenance mechanism for a container run-time script which can be added in a Dockerfile using the CMD prefix. A script added with a CMD prefix is a default script for the container to run. The default script can be ignored to run custom commands. I will propose a new kind of run-time provenance data structure with an intent similar to CMD, but with arbitrary flexibility. It fits within the DSI model (core/plugin/driver).

The Consumer Plugin

This plugin would be a child class of the SystemKernel Environment plugin, targeting a Charliecloud container run-time. The plugin should initialize with a path to a file or a git remote. The input file should use a syntax of known type, similar to a Dockerfile. The first line of the Dockerfile-like file should, by container sha, validate or acquire that the requested container is currently available in storage. The remaining text in this file describes the run-time behavior, similar to CMD. Inside the plugin data structure, this file should be inserted into a new or existing git repo. Alternatively, the plugin can be initialized with a git remote, which pulls down a similar repo. Once the git repo is init'd or downloaded, the git repo should be stored as a key-val pair in the DSI Middleware, perhaps a simple tar of the directory containing .git. This could be a column called "container_run_commands" or similar with a value referencing or literally the git tarball.

This plugin is most useful when paired with a Driver designed to interact with the git repo. The Driver would get and put the git-based run-time artifact to a back-end as usual. The artifact_handler method should set a git remote and pull or push to the remote as specified by the user. The remote should not be stored in the backend. The idea is not to make the user interact with git directly through DSI, but to give them a way to work somewhere more appropriate for highly interactive development, and move that work in and out of DSI easily.

The Producer Plugin

This plugin initializes with a git repo like the one described above, and a commit hash present in that git source tree. This plugin, when transloaded, will validate or pull the container by sha, and execute it with the run-time provenance file stored at the given commit hash. To be clear, the container sha is a reference on the container registry, but the commit hash is a reference to the run-time provenance repo controlled by this plugin.

This plugin is a child of the SystemKernel plugin, so it will inherit have all of the kernel metadata required to ensure run-time compatibility between the system configuration of previous runs and the current system config. It is not clear to me where this checking and error handling should take place yet. Maybe this should just be recorded, to start. If it ends up being useful we can add error handling.

qwofford commented 1 year ago

Before we spend substantial time on this, we should evaluate this: https://github.com/GoogleCloudPlatform/ramble