vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.5k stars 4.23k forks source link

[Feature]: Add S3/HF Hub dynamic download for LoRA adapters #3501

Open joaopcm1996 opened 7 months ago

joaopcm1996 commented 7 months ago

🚀 The feature, motivation and pitch

Request for dynamic download of LoRA adapters from S3 or HF Hub based on what model adapter id is passed in the request.

Alternatives

No alternatives as of today, adapters need to be downloaded to server upfront and locally available.

Additional context

No response

chenqianfzh commented 6 months ago

Can you add a line in your script to download the repo to a local path and run from there?

For instance, you can add lines like the following before running vLLM inference.

from huggingface_hub import snapshot_download

lora_path = snapshot_download(repo_id="yard1/llama-2-7b-sql-lora-test")
joaopcm1996 commented 6 months ago

Yes, I did this here by downloading all the adapters to disk before launching vLLM. However, the fact that all adapter ids and corresponding local paths need to be defined statically at launch means no new adapters can be loaded without relaunching the server. It also means the number of adapters that can be served is limited by the server's disk space, as there is no eviction from disk I am aware of at this point. This can be improved so that the same endpoint can stay provisioned, and new adapters can be loaded from remote object storage dynamically.

noamgat commented 6 months ago

I definitely agree with this idea. I am considering using https://github.com/predibase/lorax for this reason, but other than this feature I highly prefer vLLM.

flexchar commented 5 months ago

I would also prefer vLLM instead of Lorax that looks a lot like TGI. Ideally there could be a caching parameter for vllm that downloads adapter and deletes after x amount of time if it hasn't been used. Of course, this needs some state management.