vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.08k stars 3.82k forks source link

Can S-Lora be integrated into vLLM? #1610

Closed nivibilla closed 6 months ago

nivibilla commented 10 months ago

https://github.com/S-LoRA/S-LoRA

They call it 'unified paging' and don't require the model to be merged before doing inference. This would be really useful for serving Mixture of Expert models for example or a service that requires multiple different fine-tuned lora adapters based on the same base model.

And needless to say there has been a lot of request for lora deployments

matankley commented 10 months ago

+1

farouqaldori commented 9 months ago

+1

BharatSingla12 commented 9 months ago

+1

GayHub1010 commented 9 months ago

+1

AmoghM commented 9 months ago

+1

DavidPeleg6 commented 9 months ago

+1

jpeig commented 9 months ago

+2

draplater commented 9 months ago

+1

wDevil commented 9 months ago

+1

dongzhiwen1218 commented 9 months ago

+9

yuiant commented 9 months ago

+10086

nmhjklnm commented 9 months ago

+23333

Ted8000 commented 9 months ago

+120

nivibilla commented 9 months ago

This could possibly be the solution

1804

hahazei commented 7 months ago

+1

simon-mo commented 6 months ago

The integration is complete. https://docs.vllm.ai/en/latest/models/lora.html

ashutoshbhushan21 commented 5 months ago

Does this handle multiple loras simultaneously for concurrent calls to different lora adaptors like slora ?

simon-mo commented 5 months ago

It is exactly S-LoRA with Punica batched gemm kernels. This means concurrent calls of different lora adaptors can be executed in one batch.