Open ANYMS-A opened 4 months ago
I am getting a even strager issue, where I attemps to load vLLM + LoRA for Llama3-8B-Instruct
outputs = llm.generate(
prompts,
sampling_params,
lora_request=LoRARequest("adaptor_iter1", 1, lora_path)
)
And I got error: RuntimeError: Loading lora /root/.cache/huggingface/hub/models--Ksgk-fy--IGR-Adaptor-Meta-Llama-3-8B-Instruct-1/snapshots/37a69f049c3d0233bc77e9c8ac481d909cd6699b failed
I see the same issue as OP, but with gemma-2b. My lora trained model works as expected when I merge it and use it that way, but when running as a lora adapter as per the vllm documentation seemingly the original base model is replying.
python -m vllm.entrypoints.openai.api_server --model ../gemma-2b/ --port 8889 --enable-lora --lora-modules num=../outputs_gemma-2b_delme_num/checkpoint-1200/ --max_lora_rank 64 --max_loras 2
The reply is the same for the original gemma-2b and the num model when I query it.
I see the same issue as OP, but with gemma-2b. My lora trained model works as expected when I merge it and use it that way, but when running as a lora adapter as per the vllm documentation seemingly the original base model is replying.
python -m vllm.entrypoints.openai.api_server --model ../gemma-2b/ --port 8889 --enable-lora --lora-modules num=../outputs_gemma-2b_delme_num/checkpoint-1200/ --max_lora_rank 64 --max_loras 2
The reply is the same for the original gemma-2b and the num model when I query it.
I am not sure it can solve your issue if you reduce your max_lora_rank <= 16, according to this comment: https://github.com/vllm-project/vllm/issues/6333#issuecomment-2241511418
I haven't got a chance to try it.
Well my model is trained with lora rank 64. I have to retrain and try. Edit: I retrained with lora rank 16. Set max_lora_rank 16 when running the adapter. Same problem, the merged model outputs a good response on its own, when used as lora adapter it is not okay. Easy to see in my case I finetuned for a distinct output format.
I appear to be encountering the same issue with vllm versions 0.6.2 and 0.6.3post1
Your current environment
Collecting environment information. PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A 0S: Cent0S Linux release 7.9.2009 (Core) (x86 64) GCC version: (GCC) 11.1.0 Clang version: Could not collect CMake version: version 3.27.2 Libc version:qlibc-2.17 Python version: 3.11.4 (main, Jul5 2023,13:45:01) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-3.100-1160.e17.x86 64-x86 64-with-glibc2.17 Is CUDA available:True CUDA runtime version: 12.1.105 CUDA MODULE LOADING set to: LAZY GPU models and confiquration: GPU 0: NVIDIA.A100 80GB PCIe GPU 1: NVIDIA A100 80GB PCIe GPU 2: NVIDIA A100 80GB PCIe GPU 3:NVIDIA A100 80GB PCIe GPU 4:NVIDIA A100 80GB PCIe GPU 5:NVIDIA A100 80GB PCIe GPU 6: NVIDIA A100 80GB PCIe GPU 7: NVIDIA A100 80GB PCIe Nvidia driver version: 535.129.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available:True CPU: x86 64 Architecture: 32-6it64-bit CPU op-mode(s): Little Endian Byte Order: CPU(S): 48 On-Line CPU(s) list: Θ-47 Thread(s) per core: 1
Core(s) per socket: 24 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHZ
Stepping: 7 CPU MHz: 1000.048
CPU max MHz: 4000.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00 VT-x Virtualization; L1d cache; 32K
L1i cache 32K L2 cache: 1024K
L3 cache: 36608K
NUMA nodeO CPU(s): θ-23
24-47 NUMAnode1CPU(s)
🐛 Describe the bug
There is no error or warning during my code running stage, but compared the model that merged the LoRA's weights into the original layer's weight by
peft.PeftModel.merge_and_unload()
with the model dynamically load LoRA adaptor using vLLM'sLoRARequest
. Their output is very different, it seems the LoRA adaptor is not working when using vLLM to load the adaptor.My base model is Llama3-8B-chinese-chat.
when start the server I set the cli args as:
And the python code where I init the AsyncEngine and AsynIterator is demonstrated as below: