ministryofjustice / analytical-platform

Analytical Platform • This repository is defined and managed in Terraform
https://docs.analytical-platform.service.justice.gov.uk
MIT License
10 stars 4 forks source link

✨ Build MLflow Tracking Server for MLOps Discovery #4275

Closed PriyaBasker23 closed 2 months ago

PriyaBasker23 commented 4 months ago

Describe the feature request.

Implement a fully managed MLflow tracking server on the AWS platform help in discovery of machine learning operations within MOJ.

Details:

Backend Store: Utilise Amazon RDS to store MLflow metadata and logs securely. Artifact Storage: Use Amazon S3 for storage of machine learning models and artifacts. Tracking Server: Deploy an EC2 instance or Docker container to host the MLflow tracking server, enabling remote access

image

Other Requirements

Details information are available at https://github.com/moj-analytical-services/mlops/blob/main/docs/mlflow/mlflow_tracking_server.md

Describe the context

MLflow Tracking is a component of the MLflow platform that enables data scientists and machine learning engineers to track and log experiments during the model development process. With MLflow Tracking, users can easily record parameters, metrics, and output files from their machine learning experiments, making it easier to organize and compare different approaches. It provides a centralized location to store experiment results, allowing for efficient collaboration and reproducibility. MLflow Tracking also offers a user-friendly interface for visualizing experiment results, enabling users to gain insights into model performance and make informed decisions about model improvements.

Value / Purpose

This configuration will enable data scientists to centralise their experimental data, streamlining access to experiments for all team members. It will facilitates the ability for data scientists to integrate and test MLflow from their existing projects within Visual Studio, using the Application Platform

User Types

Data Scientist

mshodge commented 4 months ago

Hi team, you might not be able to answer this right away, but for our own MLOps work and planning, it would be really good to know the timescales you might this be deliverable over. Even, what timescales you could start to explore it, whether that's days/weeks/months away. Thank you.

mshodge commented 4 months ago

Hi @Ed-Bajo could we set some timescales for this? I'm working with the Probation and Electronic Monitoring team and we'd like it to be available for testing and use soon. Is the end of June a feasible timescale to deliver to? Thanks. Michael

bcrawford-moj commented 4 months ago

This feature would be extremely useful for the BOLD AI for Linked Data team. We currently have no good way to track ML experiments and this would be a great step towards industry best practice. We'd like to see it as soon as is possible as we are a time limited programme.

jacobwoffenden commented 3 months ago

10/06/24 summary:

jacobwoffenden commented 3 months ago

11/06/24 summary:

jacobwoffenden commented 3 months ago

12/06/24 summary:

jacobwoffenden commented 3 months ago

Moving to blocked while discussed way forward with Analytical Platform Product Management

jacobwoffenden commented 3 months ago

Notes:

mshodge commented 3 months ago

Solution one: users set their own artifact location when creating experiment

One solution is that users can define their own artifact location in MLFlow at the create experiment level (https://mlflow.org/docs/latest/rest-api.html#create-experiment) meaning they can direct artifacts to be stored at their own buckets anyway - but not sure how this works with access between MLFlow and that bucket? I will test this with the running server and see what error it gives.

Solution two: wrapper and AP control panel can be used to create experiments and assign S3 perms

There seems to be some circularity brewing with the process in that:

  1. User gets access to MLFlow and has user permissions added
  2. User creates an experiment and run using code and UI which pushed model artifacts to S3 bucket folder
  3. User needs permissions to use the model artifacts from S3 bucket outside of MLFlow

If in someway 1 can be done using their alpha user name somehow then we need a way of making sure if they make a new experiment, this then is linked back to their alpha user name for the S3 perms.

A solution might be to force users to use the AP Control Panel for creating experiments through the MLFlow API (https://mlflow.org/docs/latest/rest-api.html#create-experiment) instead of them creating them through code or the UI (although not sure how we really can prevent this :/) as then the api wrapper can also do the S3 perms at the same time at the artifact level.

jacobwoffenden commented 3 months ago

@gfowler-moj is going to put a session in to review the way forward around authentication/permissions management

jacobwoffenden commented 3 months ago

Outcome of meeting:

jacobwoffenden commented 3 months ago

I have created a group in Control Panel (analytical-platform-mlflow-admins), added @mshodge and @PriyaBasker23, and create 3 artefact buckets:

jacobwoffenden commented 3 months ago

TODO:

jacobwoffenden commented 2 months ago

alpha-analytical-platform-mlflow-development updated with below JSON

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DenyInsecureTransport",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::alpha-analytical-platform-mlflow-development",
                "arn:aws:s3:::alpha-analytical-platform-mlflow-development/*"
            ],
            "Condition": {
                "Bool": {
                    "aws:SecureTransport": "false"
                }
            }
        },
        {
            "Sid": "AllowAnalyticalPlatformMLflow",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::381491960855:role/mlflow20240610161705974000000002"
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::alpha-analytical-platform-mlflow-development",
                "arn:aws:s3:::alpha-analytical-platform-mlflow-development/*"
            ]
        }
    ]
}

MLflow is running again, but needs testing

jacobwoffenden commented 2 months ago

MLflow deployed to APC, follow on FR raised to create role for mutating permissions https://github.com/ministryofjustice/analytical-platform/issues/4593