Open ashgold opened 3 weeks ago
@ashgold this is a deadlock and is expected given the way this the distributed task scheduling currently implemented.
I would not classify it as a bug since you've modified code internal to vLLM. You can't call these lora methods from within the model execution loop, which is what is happening here. I'd suggest just calling list_loras() it as needed from a separate external thread.
@njhill I tried isolating the list_loras() method to call only when called using FastAPI's Background Task, but the result was the same. If calling distributed methods like add_loras and remove_loras in a separate thread causes a deadlock, I think this is a potential bug.
@ashgold the trouble is that when in distributed mode, these lora methods are currently essentially blocking functions and so shouldn't be called directly from an async context. It sounds like that's what you're trying here and if so it's not actually a separate thread.
It's not immediately obvious to me that list_loras
in particular needs to call out to all workers, I'll look into that.
In the meantime, you could try just calling these methods via run_in_executor.
@ashgold the trouble is that when in distributed mode, these lora methods are currently essentially blocking functions and so shouldn't be called directly from an async context. It sounds like that's what you're trying here and if so it's not actually a separate thread.
It's not immediately obvious to me that
list_loras
in particular needs to call out to all workers, I'll look into that.In the meantime, you could try just calling these methods via run_in_executor.
We have similar timeout error here
Your current environment
🐛 Describe the bug
In vLLM v0.4.3 and later, calling list_loras() in a tensor parallelism situation causes the system to hang.
Based on vLLM v0.4.3, I tried to modify the code to know the status of where the multi lora adapter is currently up on the CPU/GPU.
As shown below, I simply made a call to self.list_loras() inside the do_log_stats() method of vllm/engine/llm_engine.py.
I ran the framework through the openai entrypoint, and the do_log_stats() method works fine without LLM inference. However, the moment I call the /v1/completions API, it gets stuck on the list_loras() method, and I don't even get a response from the /v1/completions API. After 30 minutes in this state, the following error message is returned.
If I add --disable-log-stats to the argument when running, do_log_stats() is not called, so the /v1/completions API responds normally.
In v0.4.2, the list_loras() method was called correctly, but since v0.4.3, the following scheduling improvements have been made, and this code seems to be the problem.
I'm also curious as to why the above PR is an issue with the list_loras() call.