triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8k stars 1.44k forks source link

LoRA support #5968

Open TaQuangTu opened 1 year ago

TaQuangTu commented 1 year ago

Is your feature request related to a problem? Please describe. This is a feature related to how to deploy a model with LoRA supported.

Describe the solution you'd like I have a UNet model deployed on Triton Inference Server with TensorRT backend. I also have tens of LoRA weights (in Torch format) to be applied on the Unet model.

For each LoRA weight, I manually clone a UNet model (in Torch format), merge the LoRA weight into the UNet by formula: W_new = W_unet + BA where A and B is weight matrices in the given LoRA weight.

And then, manually convert the merged model to TensorRT and serve on Triton Inference Server.

With Triton, I would like to have an API/function to dynamically modify intrinsic values of running models, in this case UNet.

Additional context I think this is a well explained article to get to know LoRA, hope it help: https://lightning.ai/pages/community/tutorial/lora-llm/

kthui commented 1 year ago

Thanks for the enhancement suggestion. I have filed a ticket for us to investigate further. DLIS-5053

TaQuangTu commented 1 year ago

@kthui Thank you, it would be great if you could send us back a rough time estimate for completing it soon.

kthui commented 1 year ago

cc @Christina-Young-NVIDIA for the time estimate.

rmccorm4 commented 1 year ago

@tanmayv25 I think there was a similar request in the past to support TRT's refit API, but we ultimately went with just reloading the model, right?

I think the same applies here, the model should just be reloaded, which there are APIs for.

tanmayv25 commented 1 year ago

TRT Refit API is not suitable for production level serving systems. Making a TensorRT engine refittable comes with a performance-cost. The engine can not be optimized to the same degree when running with refittable mode. This means higher inference latency. Additionally, the engine can not be used to run inference when the model is being updated with new weights. This can lead to large tail latency.

Reloading the model with new weights is better suited solution for running serving. There is no service downtime as the requests can still run on the previous model till the new model with updated weights is ready. And the obtained engine is highly optimized.

foocker commented 11 months ago

TRT Refit API is not suitable for production level serving systems. Making a TensorRT engine refittable comes with a performance-cost. The engine can not be optimized to the same degree when running with refittable mode. This means higher inference latency. Additionally, the engine can not be used to run inference when the model is being updated with new weights. This can lead to large tail latency.

Reloading the model with new weights is better suited solution for running serving. There is no service downtime as the requests can still run on the previous model till the new model with updated weights is ready. And the obtained engine is highly optimized.

so, the best way is: transform the merged weights to trt, and the do as the usual way on triton?