vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.63k stars 3.72k forks source link

[Bug]: Failing to find LoRA adapter for MultiLoRA Inference #4520

Open RonanKMcGovern opened 4 months ago

RonanKMcGovern commented 4 months ago

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

I'm running the latest docker image and an openai style endpoint.

My command is:

--model NousResearch/Meta-Llama-3-8B-Instruct --max-model-len 8192 --port 8000 --enable-lora --lora-modules forced-french=Trelis/Meta-Llama-3-8B-Instruct-forced-french-adapters --max-loras 1 --max-lora-rank 8

I'm hitting the endpoint (on runpod) with:

curl https://y55xy7ozoxrn15-8000.proxy.runpod.net/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "forced-french",
        "prompt": "Why did the chicken cross the road?",
        "max_tokens": 50,
        "temperature": 0
    }'

The error is:

terminal: Internal Server Error

logs:

2024-05-01T08:45:12.669025667Z ERROR 05-01 08:45:12 async_llm_engine.py:43]     return func(*args, **kwargs)
2024-05-01T08:45:12.669029441Z ERROR 05-01 08:45:12 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 249, in execute_model
2024-05-01T08:45:12.669035014Z ERROR 05-01 08:45:12 async_llm_engine.py:43]     output = self.model_runner.execute_model(seq_group_metadata_list,
2024-05-01T08:45:12.669038735Z ERROR 05-01 08:45:12 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-05-01T08:45:12.669042411Z ERROR 05-01 08:45:12 async_llm_engine.py:43]     return func(*args, **kwargs)
2024-05-01T08:45:12.669045915Z ERROR 05-01 08:45:12 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 830, in execute_model
2024-05-01T08:45:12.669049531Z ERROR 05-01 08:45:12 async_llm_engine.py:43]     self.set_active_loras(lora_requests, lora_mapping)
2024-05-01T08:45:12.669053065Z ERROR 05-01 08:45:12 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 940, in set_active_loras
2024-05-01T08:45:12.669056691Z ERROR 05-01 08:45:12 async_llm_engine.py:43]     self.lora_manager.set_active_loras(lora_requests, lora_mapping)
2024-05-01T08:45:12.669060878Z ERROR 05-01 08:45:12 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 112, in set_active_loras
2024-05-01T08:45:12.669064591Z ERROR 05-01 08:45:12 async_llm_engine.py:43]     self._apply_loras(lora_requests)
2024-05-01T08:45:12.669068158Z ERROR 05-01 08:45:12 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 234, in _apply_loras
2024-05-01T08:45:12.669071818Z ERROR 05-01 08:45:12 async_llm_engine.py:43]     self.add_lora(lora)
2024-05-01T08:45:12.669075342Z ERROR 05-01 08:45:12 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 241, in add_lora
2024-05-01T08:45:12.669078878Z ERROR 05-01 08:45:12 async_llm_engine.py:43]     lora = self._load_lora(lora_request)
2024-05-01T08:45:12.669082375Z ERROR 05-01 08:45:12 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 161, in _load_lora
2024-05-01T08:45:12.669085905Z ERROR 05-01 08:45:12 async_llm_engine.py:43]     raise RuntimeError(
2024-05-01T08:45:12.669089460Z ERROR 05-01 08:45:12 async_llm_engine.py:43] RuntimeError: Loading lora Trelis/Meta-Llama-3-8B-Instruct-forced-french-adapters failed
2024-05-01T08:45:12.670086449Z Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7fdba4287400>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fdbaf021510>>)
2024-05-01T08:45:12.670134866Z handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7fdba4287400>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7fdbaf021510>>)>
2024-05-01T08:45:12.670143323Z Traceback (most recent call last):
2024-05-01T08:45:12.670151190Z   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 149, in _load_lora
2024-05-01T08:45:12.670157520Z     lora = self._lora_model_cls.from_local_checkpoint(
2024-05-01T08:45:12.670163614Z   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/models.py", line 210, in from_local_checkpoint
2024-05-01T08:45:12.670170560Z     with open(lora_config_path) as f:
2024-05-01T08:45:12.670176984Z FileNotFoundError: [Errno 2] No such file or directory: 'Trelis/Meta-Llama-3-8B-Instruct-forced-french-adapters/adapter_config.json'
2024-05-01T08:45:12.675367370Z     output = await make_async(self.driver_worker.execute_model)(
2024-05-01T08:45:12.675368878Z   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
2024-05-01T08:45:12.675370415Z     result = self.fn(*self.args, **self.kwargs)
2024-05-01T08:45:12.675371860Z   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-05-01T08:45:12.675373398Z     return func(*args, **kwargs)
2024-05-01T08:45:12.675375002Z   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 249, in execute_model
2024-05-01T08:45:12.675376462Z     output = self.model_runner.execute_model(seq_group_metadata_list,
2024-05-01T08:45:12.675377925Z   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
2024-05-01T08:45:12.675379612Z     return func(*args, **kwargs)
2024-05-01T08:45:12.675381100Z   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 830, in execute_model
2024-05-01T08:45:12.675382617Z     self.set_active_loras(lora_requests, lora_mapping)
2024-05-01T08:45:12.675384082Z   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 940, in set_active_loras
2024-05-01T08:45:12.675385562Z     self.lora_manager.set_active_loras(lora_requests, lora_mapping)
2024-05-01T08:45:12.675387366Z   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 112, in set_active_loras
2024-05-01T08:45:12.675388845Z     self._apply_loras(lora_requests)
2024-05-01T08:45:12.675390312Z   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 234, in _apply_loras
2024-05-01T08:45:12.675391802Z     self.add_lora(lora)
2024-05-01T08:45:12.675393260Z   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 241, in add_lora
2024-05-01T08:45:12.675394687Z     lora = self._load_lora(lora_request)
2024-05-01T08:45:12.675396124Z   File "/usr/local/lib/python3.10/dist-packages/vllm/lora/worker_manager.py", line 161, in _load_lora
2024-05-01T08:45:12.675397568Z     raise RuntimeError(
2024-05-01T08:45:12.675399164Z RuntimeError: Loading lora Trelis/Meta-Llama-3-8B-Instruct-forced-french-adapters failed

Note that the model does display correctly when I hit the /models endpoint:

curl https://y55xy7ozoxrn15-8000.proxy.runpod.net/v1/models       
{"object":"list","data":[{"id":"NousResearch/Meta-Llama-3-8B-Instruct","object":"model","created":1714553209,"owned_by":"vllm","root":"NousResearch/Meta-Llama-3-8B-Instruct","parent":null,"permission":[{"id":"modelperm-9630fa8b3fd84626ab68bdd5da94c8a2","object":"model_permission","created":1714553209,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]},{"id":"forced-french","object":"model","created":1714553209,"owned_by":"vllm","root":"NousResearch/Meta-Llama-3-8B-Instruct","parent":null,"permission":[{"id":"modelperm-31e64ce42d8b460c9551118150aa27c1","object":"model_permission","created":1714553209,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

Further notes:

  1. My adapter is rank 4 (I notice that max rank can only be 8,16,32 - I'm unsure if this implies that a rank of 4 may not be supported).
  2. My adapter safetensors file contains embeddings (as they were updated). Could this be the cause of the issue? It would make sense if only LoRAs of linear layers are supported.
  3. My LoRA is only applied to one layer. Is this potentially the issue?
RonanKMcGovern commented 3 months ago

Can someone assist on this bug? Thanks

LattaruoloAndrea commented 3 months ago

Hello I do have the same issue but from loading the adapter from a local directory

docker run  -v  /.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=..." -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model meta-llama/Llama-2-7b-hf --enable-lora --lora-modules sql-my-test-lora=./FINE_TUNED_MODELS_TEXT_TO_SQL/checkpoint-500/

And I get the following error

    lora = self._lora_model_cls.from_local_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/vllm/lora/models.py", line 214, in from_local_checkpoint
    with open(lora_config_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: './FINE_TUNED_MODELS_TEXT_TO_SQL/checkpoint-500/adapter_config.json'

but inside the directory there is the adapter_config.json (notice I also display the models)

curl http://localhost:8892/v1/models  
{"object":"list","data":[{"id":"meta-llama/Llama-2-7b-hf","object":"model","created":1716542398,"owned_by":"vllm","root":"meta-llama/Llama-2-7b-hf","parent":null,"permission":[{"id":"modelperm-9dcdb84d92de4bc7ba960fa844213025","object":"model_permission","created":1716542398,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]},{"id":"sql-my-test-lora","object":"model","created":1716542398,"owned_by":"vllm","root":"meta-llama/Llama-2-7b-hf","parent":null,"permission":[{"id":"modelperm-403c7eca17474cf795a88605cd75fa47","object":"model_permission","created":1716542398,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

@RonanKMcGovern did you try to load adapters from a local directory?

Thanks