vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.99k stars 4.71k forks source link

[Bug]: The driver_worker gets stuck 100% of the time, when using Medusa with TP > 1 #9573

Open Abatom opened 1 month ago

Abatom commented 1 month ago

Your current environment

The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu121 OS: Ubuntu 22.04.3 LTS (x86_64) Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) GPU models and configuration: GPU 0: NVIDIA A800-SXM4-80GB GPU 1: NVIDIA A800-SXM4-80GB CPU: Architecture: x86_64 Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-dali-cuda120==1.33.0 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] nvidia-pyindex==1.0.9 [pip3] pynvml==11.4.1 [pip3] pyzmq==25.1.2 [pip3] torch==2.4.0 [pip3] transformers==4.45.2 vLLM Version: 0.6.3.post1 ```

Model Input Dumps

None

🐛 Describe the bug

Run using the following command

export CUDA_VISIBLE_DEVICES=0,1
export VLLM_ATTENTION_BACKEND="FLASH_ATTN"

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 30000 \
  --served-model-name base_model --tokenizer-mode auto --max-model-len 2048 \
  --max-num-batched-tokens 20480 --max-num-seqs 8 \
  --tensor-parallel-size 2  --trust-remote-code \
  --gpu-memory-utilization 0.8 --disable-custom-all-reduce --dtype float16 \
  --speculative-model /home/work/qwen2/medusa \
  --model /home/work/qwen2 \
  --use-v2-block-manager --num-speculative-tokens 2 \
  --enable-prefix-caching

The driver_worker is stuck here because the second worker (GPU) did not compute the logits returned by lm_head.linear_method.apply(). The phenomenon is that the second worker process reports an error "No available block found in 60 seconds".

#vllm/model_executor/layers/logits_processor.py
def _get_logits(
    self,
    hidden_states: torch.Tensor,
    lm_head: VocabParallelEmbedding,
    embedding_bias: Optional[torch.Tensor],
) -> Optional[torch.Tensor]:
    logits = lm_head.linear_method.apply(lm_head,
                                         hidden_states,
                                         bias=embedding_bias)
    if self.use_gather:
        logits = tensor_model_parallel_gather(logits) 
        # HERE!!!, The driver_worker is stuck here because the second worker (GPU) did not compute the logits returned by lm_head.linear_method.apply(). 

    else:
        logits = tensor_model_parallel_all_gather(logits)
    if logits is not None:
        logits = logits[..., :self.org_vocab_size]
    return logits

Additionally, I found that everything works fine when using n-gram.

export CUDA_VISIBLE_DEVICES=0,1
export VLLM_ATTENTION_BACKEND="FLASH_ATTN"

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 30000 \
  --served-model-name base_model --tokenizer-mode auto --max-model-len 2048 \
  --max-num-batched-tokens 20480 --max-num-seqs 8 \
  --tensor-parallel-size 2  --trust-remote-code \
  --gpu-memory-utilization 0.8 --disable-custom-all-reduce --dtype float16 \
  --speculative-model="[ngram]" --ngram_prompt_lookup_max=4 \
  --model /home/work/qwen2 \
  --use-v2-block-manager --num-speculative-tokens 2 \
  --enable-prefix-caching

Before submitting a new issue...

Abatom commented 1 month ago

@abhigoyal1997 , Take a look!

junzhang-zj commented 3 weeks ago

Is there any progress on the TP implementation of draft based speculative decoding?