vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.61k stars 3.9k forks source link

[Usage]: How to disable multi lora to avoid using punica ? Or is the punica being the only choice? #4434

Open laoda513 opened 4 months ago

laoda513 commented 4 months ago

I searched the outdated issues, and everyone is saying that the version of multi lora's punica must be >=8.0. Therefore, I want to ask if there is an option that only uses a standalone lora but supports cuda=7.5?

I have tried examples/offline_inference.py, which llm.generate with only 1 lora. But it still ran into punica, and then it prompted that cuda>=8 is required.

robertgshaw2-neuralmagic commented 4 months ago

If you only have one lora adapter, simply merge the adapter back into your model and you can use it directly

laoda513 commented 4 months ago

Thanks!

hmmmm, although, that sound complicate If I want to save multi lora copies 。。。else it would take too much dick space。。

yyccli commented 4 months ago

if your model is not fine-tuned with bfloat16 type, then you can just compile float16 type kernels and float16 kernels support sm>=75

laoda513 commented 4 months ago

if your model is not fine-tuned with bfloat16 type, then you can just compile float16 type kernels and float16 kernels support sm>=75

thanks, how can I ask how to complie float16 kernels? Did not find it in docs.

yyccli commented 4 months ago
  1. you need to comment out some operations related to bf16 in vec_dtypes.cuh and punica_ops.cc
  2. modify the CMakeLists.txt file to allow sm75 flag
laoda513 commented 4 months ago

sounds

  1. you need to comment out some operations related to bf16 in vec_dtypes.cuh and punica_ops.cc

  2. modify the CMakeLists.txt file to allow sm75 flag

  3. you need to comment out some operations related to bf16 in vec_dtypes.cuh and punica_ops.cc

  4. modify the CMakeLists.txt file to allow sm75 flag

OK. thank you! Quite challenge to me, I will take a try.