Closed markovalexander closed 3 months ago
You may want to refer to this troubleshooting guide.
I think you maybe able to fix this by changing the following line to device=self.device
:
Be-aware though there is some instability. I was facing crashes in the custom all reduce CUDA kernels when trying to use LoRAs of rank=2, but it worked fine with a LoRA of rank = 8.
Another option is to disable memory pinning here:
https://github.com/vllm-project/vllm/blob/main/vllm/lora/models.py#L219
Just set this line to be pin_memory = False
Thanks for your suggestions @sampritipanda ! I tried them, but they all did not help in my case.
What did help though is adding device='cuda:0'
in this line. I understand that this could probably cause slight memory overhead on 1st process, but not sure if this is actually the case since self.device
is just cuda
and not cuda:PROCESS_IDX
meaning that later in code LoRA still moves to 0
device and then gets sharded across all processes.
I also set OMP_NUM_THREADS
env to 15
in all processes (just random number), it also slightly improved model and lora loading speed.
Hello. I have been investigating this behavior as well. I believe the root cause of the slow down is CPU contention and throttling, particularly in an env like Kubernetes with containers with CPU limits.
Setting OMP_NUM_THREADS
to a low number (like 1 or 2) was the solution I found, it reduced the load time for the LoRA adapter I was testing with from 2 mins to 3.5s with TP=4.
My test env was in Kubernetes on a node with 80 total cores, cpu requests set to 8 and cpu limits set to 16. Without OMP_NUM_THREADS
, I think each shard would spawn 80 threads and they'd fight for the CPU. Watching the load with top
I noticed CPU time exceeding 16, which would result in throttling on top of context switching costs.
Your current environment
🐛 Describe the bug
Build openai docker from https://github.com/vllm-project/vllm/releases/tag/v0.5.0.post1
Started vLLM for llama-2-70b model with LoRA support and tensor-parallel=4. First lora-request will take more than 1 minute. Problem is in this function: it is very fast on first process and super slow (>40second) on all other processes. Here is log output:
As you can see, all 4 processes start loading lora at 14:44:43. But only first one finished at 14:44:46 and 3 other finish only at 14:45:23. What is the problem?