vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.21k stars 3.99k forks source link

[Bug]: Load LoRA adaptor for Llama3 seems not working #6250

Open ANYMS-A opened 2 months ago

ANYMS-A commented 2 months ago

Your current environment

Collecting environment information. PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A 0S: Cent0S Linux release 7.9.2009 (Core) (x86 64) GCC version: (GCC) 11.1.0 Clang version: Could not collect CMake version: version 3.27.2 Libc version:qlibc-2.17 Python version: 3.11.4 (main, Jul5 2023,13:45:01) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-3.100-1160.e17.x86 64-x86 64-with-glibc2.17 Is CUDA available:True CUDA runtime version: 12.1.105 CUDA MODULE LOADING set to: LAZY GPU models and confiquration: GPU 0: NVIDIA.A100 80GB PCIe GPU 1: NVIDIA A100 80GB PCIe GPU 2: NVIDIA A100 80GB PCIe GPU 3:NVIDIA A100 80GB PCIe GPU 4:NVIDIA A100 80GB PCIe GPU 5:NVIDIA A100 80GB PCIe GPU 6: NVIDIA A100 80GB PCIe GPU 7: NVIDIA A100 80GB PCIe Nvidia driver version: 535.129.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available:True CPU: x86 64 Architecture: 32-6it64-bit CPU op-mode(s): Little Endian Byte Order: CPU(S): 48 On-Line CPU(s) list: Θ-47 Thread(s) per core: 1
Core(s) per socket: 24 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHZ
Stepping: 7 CPU MHz: 1000.048
CPU max MHz: 4000.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00 VT-x Virtualization; L1d cache; 32K
L1i cache 32K L2 cache: 1024K
L3 cache: 36608K
NUMA nodeO CPU(s): θ-23
24-47 NUMAnode1CPU(s)

🐛 Describe the bug

There is no error or warning during my code running stage, but compared the model that merged the LoRA's weights into the original layer's weight by peft.PeftModel.merge_and_unload() with the model dynamically load LoRA adaptor using vLLM's LoRARequest. Their output is very different, it seems the LoRA adaptor is not working when using vLLM to load the adaptor.

My base model is Llama3-8B-chinese-chat.

when start the server I set the cli args as:

...
--enable-lora \
--max-loras 4 \
--max-lora-rank 32 \
...

And the python code where I init the AsyncEngine and AsynIterator is demonstrated as below:

    engine_args = AsyncEnginArgs.from_cli_args(cli_args)
    vllm_engine = AsyncLLMEngine.from_engine_args(engine_args)

    input_ids = prompt_to_input_ids(input_prompt, mode, tokenizer)
    sampling_params = SamplingParams(**generation_kwargs)
    lora_request = LoRARequest("lora_adaptor", 1, "my_local_peft_Lora_adaptor_weights_path")
    request_id = random_uuid()

    model_cfg = await vllm_engine.get_model_config()
    max_model_len = model_cfg.max_model_len
    if len(input_ids) > max_model_len:
        msg = f"Input tokens length {len(input_ids)} exceed 'max_model_len': {max_model_len}"
        LOGGER.error(msg)
        raise InputExceedError(msg)

    async_iterator: AsyncIterator = vllm_engine.generate(
        prompt=None,
        prompt_token_ids=input_ids,
        sampling_params=sampling_params,
        request_id=request_id,
        lora_request=lora_request
    )
fangyuan-ksgk commented 1 month ago

I am getting a even strager issue, where I attemps to load vLLM + LoRA for Llama3-8B-Instruct

outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=LoRARequest("adaptor_iter1", 1, lora_path)
)

And I got error: RuntimeError: Loading lora /root/.cache/huggingface/hub/models--Ksgk-fy--IGR-Adaptor-Meta-Llama-3-8B-Instruct-1/snapshots/37a69f049c3d0233bc77e9c8ac481d909cd6699b failed

kvikk commented 1 month ago

I see the same issue as OP, but with gemma-2b. My lora trained model works as expected when I merge it and use it that way, but when running as a lora adapter as per the vllm documentation seemingly the original base model is replying. python -m vllm.entrypoints.openai.api_server --model ../gemma-2b/ --port 8889 --enable-lora --lora-modules num=../outputs_gemma-2b_delme_num/checkpoint-1200/ --max_lora_rank 64 --max_loras 2 The reply is the same for the original gemma-2b and the num model when I query it.

ANYMS-A commented 1 month ago

I see the same issue as OP, but with gemma-2b. My lora trained model works as expected when I merge it and use it that way, but when running as a lora adapter as per the vllm documentation seemingly the original base model is replying. python -m vllm.entrypoints.openai.api_server --model ../gemma-2b/ --port 8889 --enable-lora --lora-modules num=../outputs_gemma-2b_delme_num/checkpoint-1200/ --max_lora_rank 64 --max_loras 2 The reply is the same for the original gemma-2b and the num model when I query it.

I am not sure it can solve your issue if you reduce your max_lora_rank <= 16, according to this comment: https://github.com/vllm-project/vllm/issues/6333#issuecomment-2241511418

I haven't got a chance to try it.

kvikk commented 1 month ago

Well my model is trained with lora rank 64. I have to retrain and try. Edit: I retrained with lora rank 16. Set max_lora_rank 16 when running the adapter. Same problem, the merged model outputs a good response on its own, when used as lora adapter it is not okay. Easy to see in my case I finetuned for a distinct output format.