tensorflow / serving

A flexible, high-performance serving system for machine learning models
https://www.tensorflow.org/serving
Apache License 2.0
6.18k stars 2.19k forks source link

Lazily load models and on insufficient resources for a load, look to unload idle models #1403

Open kimjuny opened 5 years ago

kimjuny commented 5 years ago

Feature Request

Describe the problem the feature is intended to solve

I have thousands of models to be served, but quite a big part of these models are not frequently requested, actually only a few of them are. Loading all of these models into memory at the same time is quite resource-consuming & time-consuming.

Describe the solution

I wonder if there's an option to lazy load models and only caching those most frequently requested models in memory?

Describe alternatives you've considered

None yet.

Additional context

Actually I'm not sure if this is a feature request or we are already having this feature.

rmothukuru commented 5 years ago

Can you please check this link about Lazy Loading (Warm Starting) and let us know if it helps.

kimjuny commented 5 years ago

Can you please check this link about Lazy Loading (Warm Starting) and let us know if it helps.

I didn't set any "warm up", I'm using tf.keras API.

Saving the model by tf.keras.experimental.export_saved_model(model, saved_model_path) API(2.0.0-beta-1).

I think the link above is something that reduces the first request latency and requires more memory on init(model loading stage), kind of opposite feature that I need?

rmothukuru commented 5 years ago

@kimjuny , Sorry for the misunderstanding. Can you please elaborate your question so that we can provide better support.

kimjuny commented 5 years ago

@rmothukuru I have lots of models to be served(tens of thousands of SavedModel), I've configured them in Serving Config. When I'm starting the tensorflow/serve, it consumes too much time & memory to startup. So I wonder if there is a startup option that disables loading all the models specified in Serving Config in the beginning but loading each specific models later when it is requested.

misterpeddy commented 5 years ago

Hey @kimjuny the use case definitely makes sense. There's nothing on the roadmap currently addressing this need but I'll leave this open to see if other folks develop the same need. Can you elaborate on the use case? For example, why do you have O(10K) models that can/should not be combined? And how would you know you need to load any specific one? Are they models trained for specific users and you won't know you need one of them loaded until that user sends an inference request? This kind of information will help us understand for example what the model load time SLO (<second or is minutes fine) would need to be in order for such a feature to be useful to you.

kimjuny commented 5 years ago

@unclepeddy Thanks for your reply. I'm currently trying to work out a market prediction project. In my case I have to train models individually(for every stock) for better performance, with different versions of them. I'm only running the inference procedure when the market is closed, so actually I have plenty of time, which means a few seconds of latency(of each model's loading stage) is totally acceptable for me. So here are two features (I assumed) that could really help me:

1. lazy-load option When -e LAZY_LOAD=true is set when starting tf/serve, the server should start immediately without loading any of the models defined in Serving Config. Each model (that specified in the Serving Config) would only be loaded into memory when a specific inference request is sent.

2. memory management e.g. FIFO、LRU... When -e LAZY_LOAD=true is set, tf/serve automatically manage memory by an option(something like -e FIFO_TOTAL=500, which means only load 500 models at most by "first in first out" principle)

Both of these feature would be a huge enhancement for my case, first one would really boost up the starting stage, and second one would save a lot of memory when the server is running.

misterpeddy commented 5 years ago

Automatically unloading models subject to some policy (LRU as you mention or after a set period of inactivity) is something that we've thought about but no concrete plans to add to model server. Internally we implement this logic outside of model server, which is something you could do as well. i.e. a simple endpoint that your clients can call that will coordinate with model server to ensure the model the inference request will be requesting is loaded (and if the server doesn't have enough memory to load, it unloads the least recently used model, the mapping for which would have to be maintained by this service). I realize that implementing this is not trivial but this is not a common request that we get since model server is usually used in online settings, in which doing a load from source (latency in high seconds) during inference (latency usually in ms) is unacceptable. So we'll have to receive more examples of use cases to consider designing and implementing it.

misterpeddy commented 5 years ago

A note on the "lazily load models": We have a prototype caching_manager that handles the lazy loading of the models at inference time.

What this feature request tracks is the second piece: An eviction mechanism that upon a load that is unsuccessful due to inefficient resources, unloads idle models subject to some policy like LRU, ARC, etc.

The reason I included "lazily load models" in title is that we'd also need to do some plumbing to expose this as an option on the model server - today both model server and server core are strongly tied to using the aspired_version_manager.

guillaumekln commented 5 years ago

I have also some interests in this feature.

My application is machine translation where some use cases include having many models (i.e. many language pairs) but few users. They would typically translate a document that will generate many requests to a single model.

To mitigate the loading latency, I know a PyTorch-based server which actually copies the model weights from the system memory to the device memory (and the opposite when unloading). This might not be possible in the context of TensorFlow Serving but it's good to have that in mind.

misterpeddy commented 5 years ago

@guillaumekln thanks for the comment. Could you share a pointer to this server? I'm curious to understand

guillaumekln commented 5 years ago

This is the server I had in mind: https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/translate/translation_server.py. A configuration file defines which models unload to the system memory (to_cpu) and which models unload completely (unload) when unused.

if the delta between system and device memory limits indeed make a material difference for most use cases (my assumption is that no since AFAIK they're usually the same order of magnitude)

During execution, the device memory includes the model weights and all intermediate buffers and cache of the graph. However, in this example only the model weights will be moved to the system memory. So the memory pressure will be different.

In machine translation, the model weights are generally below 500MB but during execution the memory usage can go up to 2GB. I suppose it is similar for other tasks.

whether this pre-loading of model weights helps cases like machine translation where I'd imagine most of the loading time will be spent on the assets (embeddings)

It's helpful to reduce the initial latency to the point the user does not notice that the model was not fully available upon receiving the request. I don't have any numbers though.

kuien commented 5 years ago

We meet similar requirements that we can't load all the models into device at the same time, it would be better to swap models between GPU mem and CPU mem.

My story is: we are developing a medical diagnosis system. During our workflow, many models will be imported, they are organized as a DAG like pipeline. That is, if a patient has been predicated with one kind of diseases, we proceed to classify it into major subdivisions, generate the attributes of specified subdivision and generate details of each kind of lesions with various models. So not all models are needed, it depends on the workflow and branches.

We can't load all the models into the mem at the beginning, and models on upstream of workflow are useless in following computing. But I don't want to unload/drop those models thoroughly, because they are useful in next patient medical images processing. So may we support dynamic models swap between GPU mem and CPU mem in near future?

misterpeddy commented 5 years ago

Adding a note to this issue that several users at 2019 Oreilly AI conference also requested this feature.

battuzz commented 3 years ago

Hi everyone, I've just run across this same issue recently. We are organized in this way: many different group in the organization can write their own model and publish it in our internal tensorflow serving. We must ensure that each model can be queried (the latency for us is not a hard requirement) but we are at a point where our server's memory usage is near the limit. Some form of model unloading policy would be really helpful in our situation..

Do you have some plan to implement this feature in the near future?

Thanks

anuragtr commented 3 years ago

This feature definitely makes sense and multi-model endpoint feature of AWS Sagemaker provides this: https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html Anything implemented like this which provides lazy loading of models, and hence makes it possible to deploy 1000s of models on a single and normal instance, would be really good. I have already come across a use case like this. But AWS supports this only for CPU and not GPU deployments.

mKaloer commented 2 years ago

I have created a project that lazily loads/unloads models into TF Serving from disk, blob storage etc. It may be useful for some of you: https://github.com/mKaloer/TFServingCache