[FR] Install dependencies from AWS CodeArtifact

raxod502-plaid commented 3 months ago

Willingness to contribute

Yes. I can contribute this feature independently.

Proposal Summary

Currently, it is only possible to install Python dependencies from unauthenticated package registries, because there is no support for supplying ephemeral authentication credentials in the requirements.txt format supported by MLflow.

Several containers provided by AWS support an CA_REPOSITORY_ARN environment variable which, if provided, automatically triggers the dependency installer to authenticate to the supplied CodeArtifact repository and set it as the index URL before installing dependencies. Adopting the same standard for MLflow would be one option. In this case, other authenticated repositories could be supported by differently named environment variables. This would allow for maximum ease of use, but explicit support would be needed for each repository that somebody wanted to use.

An alternative implementation would be to allow for a user-provided hook to be run before package installation, for example a shell script at a specific position on the filesystem. Such a hook could perform whatever installation and Pip setup commands the user desires to be run. The advantage of this is it would be vendor-agnostic. On the other hand, the user would need to do more work regardless of which custom package registry they use.

Motivation

What is the use case for this feature?

Installing dependencies of a model that are not available from a public package registry, or should be installed from an internal proxy that requires authentication.

Why is this use case valuable to support for MLflow users in general?

Currently, MLflow is realistically only compatible with unauthenticated Python package registries (or ones that allow for long-lived authentication tokens), which impedes the adoption of improved security and authentication postures for supply-chain security.

Why is this use case valuable to support for your project(s) or organization?

We're moving our internal Python package hosting from an unauthenticated registry to AWS CodeArtifact, which does require authentication.

Why is it currently difficult to achieve this use case?

There is currently no way to provide credentials for MLflow to use while installing packages, other than hardcoding them into requirements.txt, which does not work since it is not possible to obtain long-lived credentials for AWS CodeArtifact (the maximum is 12 hours).

Details

No response

What component(s) does this bug affect?

[ ] area/artifacts: Artifact stores and artifact logging
[ ] area/build: Build and test infrastructure for MLflow
[ ] area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
[ ] area/docs: MLflow documentation pages
[ ] area/examples: Example code
[ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
[X] area/models: MLmodel format, model serialization/deserialization, flavors
[ ] area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
[ ] area/projects: MLproject format, project running backends
[ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
[ ] area/server-infra: MLflow Tracking server backend
[ ] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

[ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
[ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
[ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
[ ] area/windows: Windows support

What language(s) does this bug affect?

[ ] language/r: R APIs and clients
[ ] language/java: Java APIs and clients
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[ ] integrations/azure: Azure and Azure ML integrations
[ ] integrations/sagemaker: SageMaker integrations
[ ] integrations/databricks: Databricks integrations

B-Step62 commented 3 months ago

@raxod502-plaid Thank you for the proposal! The support for private/authenticated repository makes a lot sense to me.

One question about the solution when/where we expect users to configure authentication. Per my understanding, tomake AWS Code Artifacts work as a proxy index that pip install command can use, users need to either run aws codeartifact login or pip config to set up setting. Does it make sense to assume users already run it outside MLflow (so we only need to switch destination), or should we also handle it within dependency installation logic?

If it is the latter case and need some design choice, it would be a good idea to have a quick 2 pager with the OSS design template.

raxod502-plaid commented 3 months ago

I think unfortunately it is necessary to allow users to run the setup command within MLflow, if we want to support use cases like AWS SageMaker, where (to my knowledge) you are only able to provide a Docker image and it is just run as is with the model configuration mounted in - you don't have any control over the rest of the environment, other than setting env-vars.

I'll write a document following that template.

B-Step62 commented 3 months ago

@raxod502-plaid Makes sense, thank you for the clarification. Please let us know once the draft is ready, much appreciated.

raxod502-plaid commented 3 months ago

Here's a design proposal: https://docs.google.com/document/d/1M47mxxkDO7tkol9hVoxd3SMeMnfYtlyOda2BGQXVxnE/edit

github-actions[bot] commented 3 months ago

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

mlflow / mlflow