How can you explicitly track the origin of code impounded into the Docker image Kubeflow expects for TFJob?

espears1 commented 4 years ago

/kind question

Question:

One of the aspects of TFJob yaml config files is the name of a Docker image to use for executing that TFJob's commands. The standard approach for this is to package everything in that container, inclusive of all of the code to be used for executing the TFJob commands.

Naively this sounds like it is great for provenance and tracking, but in actuality it leaves a glaring issue with lack of reproducibility. While it is true that "everything's in the container" so you can pull the same Docker container and manually reproduce the steps of the TFJob commands, the glaring issue is that you don't necessarily have any source control over the code that got impounded into the container during some Docker build step.

An alternative idea is that a TFJob should reference a specific commit of the code from a source code repository, and then clone that repository and mount the code into a Docker environment suitable for training (specifically do not include the code as part of the build). This way the TFJob yaml artifact tells you the precise container as well as the precise commit of the code, and then any training arguments or other settings can be factored out as ENV vars in TFJob yaml, or as specific settings inside command steps, like CLI args or variable exports.

Consider an example where I checkout a copy of my team's source code repo and make a branch. I mess around with some ad hoc changes but I want to test them to see if they work in a full end to end Kubeflow run before committing anything. If I build a Docker image from this state of my local copy of the code, nobody else can checkout that code and review it, check for errors, etc. Sure, they can pull my container and try to scrape the code out to see what is different, what my local changes are, but they can't actually generate a real version control diff easily or point to another location of the code (like in the UI of a shared version control tool like GitHub or Bitbucket).

In fact, having "everything in the container" isn't very valuable for reproducibility in that situation, because it means tracking is purely limited to the container itself and someone would have to manually compare the code of that container with a reference branch of source code (hopefully choosing a good reference branch) and use shell commands to generate diffs and so on - very much antithetical to the spirit of reproducibility.

What does Kubeflow recommend doing for a situation like this? How can you use Kubeflow in a way where "reproducibility of ambient training environment" is conceptually separated from "reproducibility of status of the code compared with version control at the moment you initiated training"?

The only two work arounds I can imagine so far are like this:

Never ever manually build containers for use in TFJob config, rather only allow containers that are "officially" built by a CI/CD pipeline. Then institute a strong convention that when someone creates a new model training algorithm, they must create that as a regular branch of the code, go through code review, and merge to a core branch that triggers CI/CD to build a container with it. Additionally, require that all aspects of model training customization are exposed as ENV vars or CLI args that can fully go in the TFJob yaml file later on. After all this, then you can write your TFJob yaml to point at the Docker image published by CI/CD containing your new model training code, and set up all parameters as ENV or embedded as part of commands. This way CI/CD is always responsible for what status of the code is impounded into the Docker image, and lone developers absolutely never are.
If lone developers should have the ability to create ad hoc Docker images for training, then the command section of TFJob yaml should always start off with some equivalent of git clone (or whatever vcs you are using) and explicitly clone the source code, checkout a particular commit or version or branch, do any installations required, and then execute model training commands. This way the TFJob yaml full contains all details about the commit of code used in the Docker image during training and anyone else can checkout that same commit of code. There could be edge cases where someone destroys or rewrites history of the vcs system and the commit / branch / version listed in the TFJob yaml doesn't exist or got mutated, but hopefully that is rare and can be controlled with better VCS policies and practices and doesn't need to be a concern of model training.

1 has the downside that developers can't easily do ad hoc training submissions to Kubeflow and need to PR and merge training routines completely separately from the circumstances when they execute them as jobs, but it has the huge upside that the source code permitted to be inside the Docker image at training time comes from one and only one source, CI/CD publishing the container, that's it, fully able to be mapped backwards to a specific commit for the repo.

2 has the downside that developers have to copy/paste a bunch of "git clone" equivalent boilerplate to extract a specific commit of code and install it into the environment, creating a lot of bloat in TFJob yaml but it has the up side that developers can do ad hoc Kubeflow training runs for tests or scratch experiments from a local branch and only worry about committing and pushing their changes for the branch, not fully required to go through code review, testing and CI/CD builds before running a job.

Very curious what other people recommend or other patterns in use, because "just shove it all in the Docker image and publish it" totally won't work for real provenance or tracking, since whatever exists inside the container cannot be mapped backwards to a version controlled history of the changes, and creates huge manual burden to scrape code out of these ad hoc Docker images if you need to diff it against a source repo or try to figure out where it diverged from another code version or where errors are.

issue-label-bot[bot] commented 4 years ago

Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.

issue-label-bot[bot] commented 4 years ago

Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.

jlewi commented 4 years ago

This doesn't seem like an issue specific to Kubeflow or TFJob. I think its a a more general problem of tying back your docker images to the source from which they are built.

The approach I've seen are a combination of the following

Use some form of binary authorization (e.g. BinAuthz on GKE) to ensure only trusted docker images run
Force trusted docker images to be built from a CI/CD pipeline that ensures they are only built from checked in and verified source
Use metadata in the Docker image to point at the source commit it was built form

espears1 commented 4 years ago

@jlewi Thanks for your comment. I think the issue is different from items 1 or 3 in your reply. Those may enforce limitations on container choices or may allow user annotations (similar to experiment metadata Kubeflow already lets a user annotate), but they don't actually enforce any tracking. If you could arrange BinAuthz's notion of trusted container to map 1-1 with a container built from a specified commit of a repo, then perhaps, but it seems quite different than its capabilities right now.

The middle item about forcing trusted containers with CI/CD is the same idea I mentioned in item 1 in my question. But it has the huge downside that roundtrips to submit training jobs are at least as slow as CI/CD processes, probably quite a bit slower, and you must be sure that the entire possible interface of customizable options for training is exposed as ENV or CLI, so that the TFJob yaml serves as a definitive artifact about those choices. In theory you could commit them in the repo, but then any parameter change needs a full roundtrip of CI/CD + rebuild of the container.

The reason I ask this in Kubeflow is that the connection of the TFJob yaml file to a specific commit is extremely fundamental. Without that connection being a first class citizen of the TFJob yaml, I think Kubeflow's entire execution model for experiment tracking can't work.

A possibly better solution would be for Kubeflow to change the semantics a little bit. Instead of assuming everything that's needed to execute the training program is already inside the Docker image, it would be better to treat the Docker image like a canvas. It's a starting point, then then TFJob yaml files need further instructions for how to take that starting point and actually put the correct version of code on it.

This is why I brought up the idea of mounting the code from version control as a volume, and then executing installation or setup scripts that are listed within TFJob yaml files. Essentially right now Kubeflow takes this point of view:

Do whatever you gotta do to get me the Docker image, then I'll just execute your commands on that image.

which is a bad paradigm.

Instead it should be more like

Tell me the base Docker image to start with, then tell me how to mount data into it, then tell me how to mount code into it, then tell me any pre-training hooks for setup, then tell me any ENV or settings to apply, then tell me the training commands, then tell me any post-training hooks or teardown.

If TFJob yaml file contain those as a series of steps, it makes all the choices trackable.

melihsqsp commented 4 years ago

@jlewi I came across the arena repo within kubeflow project. It might have some ideas mentioned, such as the ability to have a base image for tensorflow or other frameworks; and a separate mount for user-code which can be pulled from git or from elsewhere specified via command args --syncMode --syncSource. This capability could be very beneficial in production type setups for easier and more reliable tracking of source. Is it possible to include arena subrepo or similar capability in the next kubeflow 1.1 release with a better description so that it's widely available? Would it make sense to have this capability integrated directly into the tfjob/pytorch operator as an option so it can be supported more natively?

jbottum commented 3 years ago

@espears1 as you may know, Arrikto has provided some extensions to Kubeflow (i.e. Kale and Rok) to address some of the items that you describe. If you would like to set-up a conversation to discuss your requirements, please let me know.

jlewi commented 3 years ago

@melihsqsp my suggestion would be to talk to the @kubeflow/wg-training-leads about any changes to the job operators. Maybe consider opening up issues in the kubeflow/tf-operator and/or kubeflow/pytorch-operator

kubeflow / kubeflow

How can you explicitly track the origin of code impounded into the Docker image Kubeflow expects for TFJob? #5098