[Bug]: The driver_worker gets stuck 100% of the time, when using Medusa with TP > 1

Abatom commented 1 month ago

Your current environment

The output of `python collect_env.py`

```text PyTorch version: 2.4.0+cu121 OS: Ubuntu 22.04.3 LTS (x86_64) Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) GPU models and configuration: GPU 0: NVIDIA A800-SXM4-80GB GPU 1: NVIDIA A800-SXM4-80GB CPU: Architecture: x86_64 Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-dali-cuda120==1.33.0 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] nvidia-pyindex==1.0.9 [pip3] pynvml==11.4.1 [pip3] pyzmq==25.1.2 [pip3] torch==2.4.0 [pip3] transformers==4.45.2 vLLM Version: 0.6.3.post1 ```

Model Input Dumps

None

🐛 Describe the bug

Run using the following command

export CUDA_VISIBLE_DEVICES=0,1
export VLLM_ATTENTION_BACKEND="FLASH_ATTN"

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 30000 \
  --served-model-name base_model --tokenizer-mode auto --max-model-len 2048 \
  --max-num-batched-tokens 20480 --max-num-seqs 8 \
  --tensor-parallel-size 2  --trust-remote-code \
  --gpu-memory-utilization 0.8 --disable-custom-all-reduce --dtype float16 \
  --speculative-model /home/work/qwen2/medusa \
  --model /home/work/qwen2 \
  --use-v2-block-manager --num-speculative-tokens 2 \
  --enable-prefix-caching

The driver_worker is stuck here because the second worker (GPU) did not compute the logits returned by lm_head.linear_method.apply(). The phenomenon is that the second worker process reports an error "No available block found in 60 seconds".

#vllm/model_executor/layers/logits_processor.py
def _get_logits(
    self,
    hidden_states: torch.Tensor,
    lm_head: VocabParallelEmbedding,
    embedding_bias: Optional[torch.Tensor],
) -> Optional[torch.Tensor]:
    logits = lm_head.linear_method.apply(lm_head,
                                         hidden_states,
                                         bias=embedding_bias)
    if self.use_gather:
        logits = tensor_model_parallel_gather(logits) 
        # HERE!!!, The driver_worker is stuck here because the second worker (GPU) did not compute the logits returned by lm_head.linear_method.apply(). 

    else:
        logits = tensor_model_parallel_all_gather(logits)
    if logits is not None:
        logits = logits[..., :self.org_vocab_size]
    return logits

Additionally, I found that everything works fine when using n-gram.

export CUDA_VISIBLE_DEVICES=0,1
export VLLM_ATTENTION_BACKEND="FLASH_ATTN"

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 30000 \
  --served-model-name base_model --tokenizer-mode auto --max-model-len 2048 \
  --max-num-batched-tokens 20480 --max-num-seqs 8 \
  --tensor-parallel-size 2  --trust-remote-code \
  --gpu-memory-utilization 0.8 --disable-custom-all-reduce --dtype float16 \
  --speculative-model="[ngram]" --ngram_prompt_lookup_max=4 \
  --model /home/work/qwen2 \
  --use-v2-block-manager --num-speculative-tokens 2 \
  --enable-prefix-caching

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Abatom commented 1 month ago

@abhigoyal1997 , Take a look!

junzhang-zj commented 3 weeks ago

Is there any progress on the TP implementation of draft based speculative decoding?

vllm-project / vllm