[FR] - model_version_tags indexing

franco-bocci commented 4 months ago

Willingness to contribute

Yes. I can contribute this feature independently.

Proposal Summary

The table model_version_tags could benefit from an index for both the name and version field. Filtering this table for name and version increases CPU usage by a lot. Currently, the table has a PK for key, name and version but not an index for those fields.https://github.com/mlflow/mlflow/blob/master/mlflow/store/model_registry/dbmodels/models.py#L144

Motivation

What is the use case for this feature?

Improve DB performance when reading from model_version_tags

Why is this use case valuable to support for MLflow users in general?

Response time would be improved and CPU usage of the DB store could be reduced, making it more cost-efficient.

Why is this use case valuable to support for your project(s) or organization?

Our team gets a spike on the DB usage at a specific point in time caused by this query run by many projects at the same time. Instead of scaling up the database, we can contribute this if you agree it would be a good improvement.

Why is it currently difficult to achieve this use case?

I think it is not difficult.

Details

No response

What component(s) does this bug affect?

[ ] area/artifacts: Artifact stores and artifact logging
[ ] area/build: Build and test infrastructure for MLflow
[ ] area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
[ ] area/docs: MLflow documentation pages
[ ] area/examples: Example code
[X] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
[ ] area/models: MLmodel format, model serialization/deserialization, flavors
[ ] area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
[ ] area/projects: MLproject format, project running backends
[ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
[ ] area/server-infra: MLflow Tracking server backend
[ ] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

[ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
[ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
[X] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
[ ] area/windows: Windows support

What language(s) does this bug affect?

[ ] language/r: R APIs and clients
[ ] language/java: Java APIs and clients
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[ ] integrations/azure: Azure and Azure ML integrations
[ ] integrations/sagemaker: SageMaker integrations
[ ] integrations/databricks: Databricks integrations

B-Step62 commented 4 months ago

Thank you for proposing this feature, @franco-bocci ! Adding index sounds valuable for the large scale use case of the model registry. However, one concern is the migration effort as it is basically the schema change. Since many users are running the model registry in production, the migration needs to be done carefully with an easy and safe migration script/tool.

franco-bocci commented 4 months ago

Thank you for proposing this feature, @franco-bocci ! Adding index sounds valuable for the large scale use case of the model registry. However, one concern is the migration effort as it is basically the schema change. Since many users are running the model registry in production, the migration needs to be done carefully with an easy and safe migration script/tool.

Hey! Yes, I think we could add this to the alembic migrations so that the new version includes it. What do you think?

B-Step62 commented 4 months ago

Yes I think that works! Before actually start working on it, would you mind running a quick benchmark in your env with index/no-index, so we can get overall sense of the impact of indexing? It is non-trivial effort to update the schema, so we just want to make sure we can get certain performance improvement for common use cases🙂 If I can be more greedy, it would be super nice if the benchmark includes the small-medium database size, as the registry DB size is not so large for common use cases. You can also refer to the thread that we've added index to tracking database: https://github.com/mlflow/mlflow/issues/3785

franco-bocci commented 4 months ago

Hey! Sure. From my side, this are the next steps: 1) add this index to our DB 2) compare CPU usage before and after adding the index 3) sharing the results here

After that, if it's okay for you, I can work on adding this index through Alembic. Probably will be done during this week. Works for you?

github-actions[bot] commented 4 months ago

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

B-Step62 commented 3 months ago

@franco-bocci Yes that sounds like a plan! Please let us know when you get the benchmark (no time pressure!). Thank you so much for your willingness for contribution🙂

franco-bocci commented 2 months ago

Hey! Apologies for the long delay. It was a simple change, but as DB migrations can go wrong, this went stale as I had to focus on other things. We were having a CPU spike on the DB side every day at 2 a.m.from a query performing this SELECT filtering by model_version_tags.name AND model_version_tags.version. One client of the service was performing update for multiple I think models at that point in time (not sure exactly whether they were updating models).

The CPU usage went from > 90% to 30% after applying the index. No other changes done. For the index creation, we executed:

CREATE INDEX IF NOT EXISTS model_version_tags_name_and_version
ON model_version_tags (name, version);

Hope this helps as a reference.

mlflow / mlflow