[FR] The 'kubernetes' backend should allow users to use existing docker images

oreh commented 3 years ago

Thank you for submitting a feature request. Before proceeding, please review MLflow's Issue Policy for feature requests and the MLflow Contributing Guide.

Please fill in this feature request template to ensure a timely and thorough response.

Willingness to contribute

The MLflow Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature (either as an MLflow Plugin or an enhancement to the MLflow code base)?

[* ] Yes. I can contribute this feature independently.
[ ] Yes. I would be willing to contribute this feature with guidance from the MLflow community.
[ ] No. I cannot contribute this feature at this time.

Proposal Summary

The current 'kubernetes' backend requires users to rebuild and push docker images for every run. However, this makes it difficult to launch mlflow from a pod inside a Kubernetes cluster, as one normally cannot build docker image inside a pod. We should provide an option to users to launch mlflow runs with existing docker images, which could be built by some other tools or processes.

Motivation

To my understanding, the reason why we need to build a new docker image for each mlflow run is to package code and data in the working directory into the image. But

Ideally this packaging should only happen once when the source code or runtime dependency changes. We may run the experiments multiple times using various arguments but these runs do not require new images.
It is not always necessary to package everything into a new docker, since we normally have CI pipelines to build docker images and distribute code via git or data volumes in K8S.

Moreover, rebuilding and pushing images for each run is also a blocker to deploy the entire mlflow stack into a Kubernetes cluster. We prefer to develop mlflow projects in K8S Pods and launch multiple runs directly using the kubernetes backend. However, we won't allow user to build and push docker images inside a Pod for security reasons.

So it would be nice if we allows mlflow users to use a flag to skip docker image building/pushing when start runs with the 'kubernetes' backend.

What component(s), interfaces, languages, and integrations does this feature affect?

Components

[ ] area/artifacts: Artifact stores and artifact logging
[ ] area/build: Build and test infrastructure for MLflow
[ ] area/docs: MLflow documentation pages
[ ] area/examples: Example code
[ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
[ ] area/models: MLmodel format, model serialization/deserialization, flavors
[* ] area/projects: MLproject format, project running backends
[ ] area/scoring: Local serving, model deployment tools, spark UDFs
[ ] area/server-infra: MLflow server, JavaScript dev server
[ ] area/tracking: Tracking Service, tracking client APIs, autologging

Interfaces

[ ] area/uiux: Front-end, user experience, JavaScript, plotting
[ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
[ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
[ ] area/windows: Windows support

Languages

[ ] language/r: R APIs and clients
[ ] language/java: Java APIs and clients
[ ] language/new: Proposals for new client languages

Integrations

[ ] integrations/azure: Azure and Azure ML integrations
[ ] integrations/sagemaker: SageMaker integrations
[ ] integrations/databricks: Databricks integrations

Details

(Use this section to include any additional information about the feature. If you have a proposal for how to implement this feature, please include it here. For implementation guidelines, please refer to the Contributing Guide.)

oreh commented 3 years ago

I created a pull request to demonstrate how to use a flag to bypass docker build. https://github.com/mlflow/mlflow/pull/3742

fg91 commented 3 years ago

I created a pull request to demonstrate how to use a flag to bypass docker build.

3742

@oreh I tested your PR but it does not work unless the exact image is specified in the kube config.

I propose to solve the same exact problem you are having in a slightly different way in this PR #3987: you do not have to specify the image that is supposed to be used in the kube context. The image that is built on your dev machine will automatically be used for all other jobs started from within the first pod.

What do you think about this?

ShuxinLin commented 1 year ago

It seems this feature are not in release. I tested both PR https://github.com/mlflow/mlflow/pull/3742 and https://github.com/mlflow/mlflow/pull/3987, and both works well. I would love to see this feature get merged to the release by the mlflow developers. Thanks!

DhavalRepo18 commented 1 year ago

@oreh @fg91 @ShuxinLin has tested it extensively and it works.

mlflow / mlflow