The driver_worker is stuck here because the second worker (GPU) did not compute the logits returned by lm_head.linear_method.apply(). The phenomenon is that the second worker process reports an error "No available block found in 60 seconds".
#vllm/model_executor/layers/logits_processor.py
def _get_logits(
self,
hidden_states: torch.Tensor,
lm_head: VocabParallelEmbedding,
embedding_bias: Optional[torch.Tensor],
) -> Optional[torch.Tensor]:
logits = lm_head.linear_method.apply(lm_head,
hidden_states,
bias=embedding_bias)
if self.use_gather:
logits = tensor_model_parallel_gather(logits)
# HERE!!!, The driver_worker is stuck here because the second worker (GPU) did not compute the logits returned by lm_head.linear_method.apply().
else:
logits = tensor_model_parallel_all_gather(logits)
if logits is not None:
logits = logits[..., :self.org_vocab_size]
return logits
Additionally, I found that everything works fine when using n-gram.
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Your current environment
The output of `python collect_env.py`
```text PyTorch version: 2.4.0+cu121 OS: Ubuntu 22.04.3 LTS (x86_64) Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) GPU models and configuration: GPU 0: NVIDIA A800-SXM4-80GB GPU 1: NVIDIA A800-SXM4-80GB CPU: Architecture: x86_64 Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-dali-cuda120==1.33.0 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] nvidia-pyindex==1.0.9 [pip3] pynvml==11.4.1 [pip3] pyzmq==25.1.2 [pip3] torch==2.4.0 [pip3] transformers==4.45.2 vLLM Version: 0.6.3.post1 ```Model Input Dumps
None
🐛 Describe the bug
Run using the following command
The driver_worker is stuck here because the second worker (GPU) did not compute the logits returned by lm_head.linear_method.apply(). The phenomenon is that the second worker process reports an error "No available block found in 60 seconds".
Additionally, I found that everything works fine when using n-gram.
Before submitting a new issue...