vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.9k stars 4.12k forks source link

[Feature]: Multi lora on multi gpus #6133

Closed jiuzhangsy closed 1 month ago

jiuzhangsy commented 3 months ago

🚀 The feature, motivation and pitch

I need to infer using vLLM across multiple GPUs, and manage multiple LoRA.Can anyone help? Thanks very much

Alternatives

No response

Additional context

No response

jeejeelee commented 3 months ago

This example i s a great starting point, and you can set tensor_parallel_size for multiple GPUs inference. For example:

   llm = vllm.LLM(
        MODEL_PATH,
        enable_lora=True,
        max_num_seqs=16,
        max_loras=2,
        trust_remote_code=True,
        gpu_memory_utilization=0.3,
        tensor_parallel_size=4,
    )