[FastGen] Hot-swappable LoRA adapters?

microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Apache License 2.0

1.85k stars 175 forks source link

[FastGen] Hot-swappable LoRA adapters? #271

Open corbt opened 10 months ago

corbt commented 10 months ago

Hey there! FastGen seems really awesome. I'm curious whether roadmap includes support for serving models with LoRA adapters? Our use case is that we have hundreds of different LoRAs we need to serve, and keeping the fully merged models live on GPUs at all times isn't feasible. It would be awesome if FastGen implemented something like S-LoRA on top of FastGen so we can serve requests from multiple LoRAs simultaneously!

cmikeh2 commented 10 months ago

Thanks for the suggestion! I don't have a concrete timeline for something like this yet, but I do think this is great feature for us to support moving forward and will work to establish a roadmap to integrate it.