mlflow / mlflow

Open source platform for the machine learning lifecycle
https://mlflow.org
Apache License 2.0
18.78k stars 4.24k forks source link

[FR] Spark Model Cache Replacement Policy #6256

Open JohnFirth opened 2 years ago

JohnFirth commented 2 years ago

Willingness to contribute

Yes. I would be willing to contribute this feature with guidance from the MLflow community.

Proposal Summary

I'd like the ability to set a cache replacement policy for SparkModelCache, which currently has no policy. https://github.com/mlflow/mlflow/blob/9b83b355fc9c64ad1b51c66b1187eaab40d40d61/mlflow/pyfunc/spark_model_cache.py#L15

Motivation

What is the use case for this feature?

Performing batch inference with multiple models whose combined size would exhaust memory if loading them at the same time were attempted.

Why is this use case valuable to support for MLflow users in general?

Others may wish to perform such an operation. I'm not sure how common the need is.

Why is this use case valuable to support for your project(s) or organization?

I'm currently performing batch inference with hundreds of models per Spark cluster, whose individual size can be up to 1GB.

Why is it currently difficult to achieve this use case?

The spark model cache has no replacement policy so attempting the above use case could cause an OOM. https://github.com/mlflow/mlflow/blob/9b83b355fc9c64ad1b51c66b1187eaab40d40d61/mlflow/pyfunc/spark_model_cache.py#L15

Details

Perhaps this could be configured with an environment variable, but I'm not too sure. Happy to try to supply this feature with some guidance :)

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

WeichenXu123 commented 2 years ago

Do you mean policy such as LRU ?

WeichenXu123 commented 2 years ago

One issue is, how to check how much memory the model uses ? And once exceeding memory threshold, we can evict the model from cache.

JohnFirth commented 2 years ago

Hey @WeichenXu123

Do you mean policy such as LRU ?

Yeah, I think LRU would be suitable at least for my use case of multiple models, each being used one after the other.

One issue is, how to check how much memory the model uses ? And once exceeding memory threshold, we can evict the model from cache.

I think a simple upper limit on the number of models would be adequate, at least for me. (For my use case in fact, the limit could be 1.)

mlflow-automation commented 2 years ago

@BenWilson2 @dbczumar @harupy @WeichenXu123 Please assign a maintainer and start triaging this issue.

dbczumar commented 2 years ago

Hey @WeichenXu123

Do you mean policy such as LRU ?

Yeah, I think LRU would be suitable at least for my use case of multiple models, each being used one after the other.

One issue is, how to check how much memory the model uses ? And once exceeding memory threshold, we can evict the model from cache.

I think a simple upper limit on the number of models would be adequate, at least for me. (For my use case in fact, the limit could be 1.)

Hi @JohnFirth, apologies for the delay here. I think a configurable LRU cache would be great here, and we would be very excited about reviewing a PR with this feature, if you're still interested in contributing one. Please let me know if you have any questions.

JohnFirth commented 2 years ago

No worries @dbczumar :)

Happy to help, but I'm not quite sure how to set the cache size limit, tbh.

Perhaps SparkModelCache.get_or_load could receive a max_cache_size argument from spark_udf, which get_or_load then uses to enforce the limit (?)

WeichenXu123 commented 2 years ago

What about reading max_cache_size from environment variable ? You can define it in module mlflow/environment_variables.py

JohnFirth commented 2 years ago

@WeichenXu123 yeah, ok — I'll see what I can do :)

mlflow-automation commented 2 years ago

@WeichenXu123 Please reply to comments.