Automatic loading and unloading of model.

abhinav-cashify commented 2 years ago

🚀 The feature

Torchserve automatically loads and unloads the model on the basis of the request. If I have registered 3 models in torchserve. If one of the models does not get any hit in like 1 day, it will automatically unload the model from memory. Once I got the hit for that model, it will be loaded back to memory. (like the one provided by AWS Sagemaker multi-model Endpoint)

Motivation, pitch

Currently, we have to use management API to set no of workers to make inferences on that model. If my model is not going to be used for some time, I have to manually set no of workers to 0, if not, then it's continuously consuming resources, even if it's not in use. I would like to set my all models to 0 initial workers, and whenever I inference on one, it will be loaded with 1 worker.

Alternatives

No response

Additional context

No response

amit-cashify commented 2 years ago

@msaroufim

lxning commented 2 years ago

@amit-cashify @abhinav-cashify AWS Sagemaker multi-model Endpoint makes call to TorchServe to unload model based on the memory usage. This elastic loading/unloading feature is provided by Sagemaker hosting service. Customers has to pay for the inference latency spike due to model reloading cost.

On TorchServe roadmap, we are going to address memory usage and elastic parallel processing issues by providing the following features:

model sharing (ie. one model copy in memory can be shared by multiple workers)
model workers is elastic according to inference traffic volume.

Note: here,

modell workers = 0 does not mean #model copy = 0.
model copy = 0 only if unload model request is received.

Please let us know if you have any questions.

amit-cashify commented 1 year ago

Any update on this?

otakbeku commented 1 year ago

Any update? I have quite same problem with this

EdwardYGLi commented 2 months ago

I'm running into a similar issue in which the underlying mar file for the same model has been updated, instead of redeploying the endpoint, how can i reload the mar currently already loaded in the instance ?

pytorch / serve