✨ Build MLflow Tracking Server for MLOps Discovery

PriyaBasker23 commented 4 months ago

Describe the feature request.

Implement a fully managed MLflow tracking server on the AWS platform help in discovery of machine learning operations within MOJ.

Details:

Backend Store: Utilise Amazon RDS to store MLflow metadata and logs securely. Artifact Storage: Use Amazon S3 for storage of machine learning models and artifacts. Tracking Server: Deploy an EC2 instance or Docker container to host the MLflow tracking server, enabling remote access

Other Requirements

Access to the tracking server should be secured through a login system.
Only authorized individuals should be able to access the experiments.
Data in the artifacts bucket should be organised into specific folders based on user identity. These folders should be accessible only to alpha users who have the necessary permissions. Implementing folder-level permissions for alpha users can be implemented from the control panel . This setup allows users to store artifacts generated from their code execution in Visual Studio in AP.

Details information are available at https://github.com/moj-analytical-services/mlops/blob/main/docs/mlflow/mlflow_tracking_server.md

Describe the context

MLflow Tracking is a component of the MLflow platform that enables data scientists and machine learning engineers to track and log experiments during the model development process. With MLflow Tracking, users can easily record parameters, metrics, and output files from their machine learning experiments, making it easier to organize and compare different approaches. It provides a centralized location to store experiment results, allowing for efficient collaboration and reproducibility. MLflow Tracking also offers a user-friendly interface for visualizing experiment results, enabling users to gain insights into model performance and make informed decisions about model improvements.

Value / Purpose

This configuration will enable data scientists to centralise their experimental data, streamlining access to experiments for all team members. It will facilitates the ability for data scientists to integrate and test MLflow from their existing projects within Visual Studio, using the Application Platform

User Types

Data Scientist

mshodge commented 4 months ago

Hi team, you might not be able to answer this right away, but for our own MLOps work and planning, it would be really good to know the timescales you might this be deliverable over. Even, what timescales you could start to explore it, whether that's days/weeks/months away. Thank you.

mshodge commented 4 months ago

Hi @Ed-Bajo could we set some timescales for this? I'm working with the Probation and Electronic Monitoring team and we'd like it to be available for testing and use soon. Is the end of June a feasible timescale to deliver to? Thanks. Michael

bcrawford-moj commented 4 months ago

This feature would be extremely useful for the BOLD AI for Linked Data team. We currently have no good way to track ML experiments and this would be a great step towards industry best practice. We'd like to see it as soon as is possible as we are a time limited programme.

jacobwoffenden commented 3 months ago

10/06/24 summary:

KMS keys, RDS PostgreSQL, S3 bucket, IAM policy, IAM role (IRSA enabled) and Kubernetes namespace created
- https://github.com/ministryofjustice/modernisation-platform-environments/pull/6510
https://artifacthub.io/packages/helm/community-charts/mlflow is 12 minor versions behind and not officially support by MLflow
- Have started working on a very lightweight Helm chart
MLflow doesn't support anything other than basic-auth, there is currently not external IdP support
MLflow container runs as root and doesn't include Prometheus exporter package
- I have a working prototype

jacobwoffenden commented 3 months ago

11/06/24 summary:

2.13.2-rc0 released https://github.com/ministryofjustice/analytical-platform-mlflow/releases/tag/2.13.2-rc0
Testing shows MLflow cannot share the same database for both authentication and backend, that is fine, however we lack the ability to programatically create databases in MP CI/CD, so have created another RDS instance for now, will explore an initContainer/schema migration tools such as ariga's atlas to do this

jacobwoffenden commented 3 months ago

12/06/24 summary:

https://mlflow.compute.development.analytical-platform.service.justice.gov.uk/ is running
Initial thoughts on management of permissions is that its quite cumbersome using the REST APIs and could really do with a wrapper (i.e. AP UI)
- https://mlflow.org/docs/2.13.2/auth/index.html#how-it-works
- https://mlflow.org/docs/2.13.2/rest-api.html

jacobwoffenden commented 3 months ago

Moving to blocked while discussed way forward with Analytical Platform Product Management

jacobwoffenden commented 3 months ago

Notes:

Alpha users would need access to the S3 bucket to retrieve models, this in theory is OK, but we'd need to mutate an Alpha users permissions based on what experiments and models they are allowed to access

mshodge commented 3 months ago

Solution one: users set their own artifact location when creating experiment

One solution is that users can define their own artifact location in MLFlow at the create experiment level (https://mlflow.org/docs/latest/rest-api.html#create-experiment) meaning they can direct artifacts to be stored at their own buckets anyway - but not sure how this works with access between MLFlow and that bucket? I will test this with the running server and see what error it gives.

Solution two: wrapper and AP control panel can be used to create experiments and assign S3 perms

There seems to be some circularity brewing with the process in that:

User gets access to MLFlow and has user permissions added
User creates an experiment and run using code and UI which pushed model artifacts to S3 bucket folder
User needs permissions to use the model artifacts from S3 bucket outside of MLFlow

If in someway 1 can be done using their alpha user name somehow then we need a way of making sure if they make a new experiment, this then is linked back to their alpha user name for the S3 perms.

A solution might be to force users to use the AP Control Panel for creating experiments through the MLFlow API (https://mlflow.org/docs/latest/rest-api.html#create-experiment) instead of them creating them through code or the UI (although not sure how we really can prevent this :/) as then the api wrapper can also do the S3 perms at the same time at the artifact level.

jacobwoffenden commented 3 months ago

@gfowler-moj is going to put a session in to review the way forward around authentication/permissions management

jacobwoffenden commented 3 months ago

Outcome of meeting:

Switch to using Alpha bucket for artefacts
Make Priya and Michael admins

jacobwoffenden commented 3 months ago

I have created a group in Control Panel (analytical-platform-mlflow-admins), added @mshodge and @PriyaBasker23, and create 3 artefact buckets:

alpha-analytical-platform-mlflow
alpha-analytical-platform-mlflow-development
alpha-analytical-platform-mlflow-test

jacobwoffenden commented 3 months ago

TODO:

[x] Update alpha bucket policies to allow APC roles
- [x] development
- [x] test
- [x] production
[x] Update MLFlow role to access alpha buckets
[x] Update s3_bucket_name in MLFlow values
[x] Drop schemas to reset MLFlow

jacobwoffenden commented 2 months ago

alpha-analytical-platform-mlflow-development updated with below JSON

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DenyInsecureTransport",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::alpha-analytical-platform-mlflow-development",
                "arn:aws:s3:::alpha-analytical-platform-mlflow-development/*"
            ],
            "Condition": {
                "Bool": {
                    "aws:SecureTransport": "false"
                }
            }
        },
        {
            "Sid": "AllowAnalyticalPlatformMLflow",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::381491960855:role/mlflow20240610161705974000000002"
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::alpha-analytical-platform-mlflow-development",
                "arn:aws:s3:::alpha-analytical-platform-mlflow-development/*"
            ]
        }
    ]
}

MLflow is running again, but needs testing

jacobwoffenden commented 2 months ago

MLflow deployed to APC, follow on FR raised to create role for mutating permissions https://github.com/ministryofjustice/analytical-platform/issues/4593

ministryofjustice / analytical-platform