vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.3k stars 4.75k forks source link

[Bug]: Gemma-2-2b-it load model hangs by vLLM==0.5.1 on Tesla T4 GPU #7464

Closed wlwqq closed 3 months ago

wlwqq commented 3 months ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: Could not collect Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.31 Python version: 3.8.19 | packaged by conda-forge | (default, Mar 20 2024, 12:47:35) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-4.19.95-35-x86_64-with-glibc2.10 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Tesla T4 Nvidia driver version: 470.161.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: /bin/sh: lscpu: not found Versions of relevant libraries: [pip3] flashinfer==0.0.8+cu121torch2.3 [pip3] numpy==1.24.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.0 [pip3] torchvision==0.18.0 [pip3] transformers==4.44.0 [pip3] triton==2.3.0 [conda] flashinfer 0.0.8+cu121torch2.3 pypi_0 pypi [conda] numpy 1.24.4 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] torchvision 0.18.0 pypi_0 pypi [conda] transformers 4.44.0 pypi_0 pypi [conda] triton 2.3.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU0 X 24-47,72-95 1 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

from vllm import LLM, SamplingParams

import os 
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"
os.environ["VLLM_DO_NOT_TRACK"] = "1"
llm = LLM(
    model="/data/test/gemma2_2b_it_prod",
    max_model_len = 2048,
    trust_remote_code = False,
    block_size = 4,
    max_num_seqs =2,
    swap_space = 16,
    max_seq_len_to_capture = 512,
    load_format = 'auto',
    dtype = 'float16',
    kv_cache_dtype = 'auto',
    seed = 0,
    enforce_eager=True,
    gpu_memory_utilization=0.95,
    tensor_parallel_size =1,
    worker_use_ray = False   
    )

when i run above code, load model hangs

WARNING 08-13 07:04:00 config.py:1354] Casting torch.bfloat16 to torch.float16.
WARNING 08-13 07:04:00 utils.py:562] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 08-13 07:04:00 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/mnt/posfs/globalmount/gemma-2-2b-it', speculative_config=None, tokenizer='/mnt/posfs/globalmount/gemma-2-2b-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/mnt/posfs/globalmount/gemma-2-2b-it, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-13 07:04:01 selector.py:79] Using Flashinfer backend.
WARNING 08-13 07:04:01 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
INFO 08-13 07:04:01 selector.py:79] Using Flashinfer backend.
WARNING 08-13 07:04:01 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
wlwqq commented 3 months ago

i can run success in A10, but T4 hangs

youkaichao commented 3 months ago

might be a flashinfer problem, cc @LiuXiaoxuanPKU

wlwqq commented 3 months ago

my flashinfer version flashinfer-0.0.8+cu121torch2.3-cp38-cp38-linux_x86_64.whl

@youkaichao @LiuXiaoxuanPKU

robertgshaw2-neuralmagic commented 3 months ago

FlashInfer needs compute capability >8.0 (see doc: https://docs.flashinfer.ai/installation.html), so it will not work on T4. We should have a better error message than this though.

If you install the latest nightly, you can run gemma with logits soft capping via flash attention backend

wlwqq commented 3 months ago

so, i should update os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER" to os.environ["VLLM_ATTENTION_BACKEND"] = "FLASH_ATTN" ? @robertgshaw2-neuralmagic

wlwqq commented 3 months ago

vllm 0.5.1 in selector.py

if selected_backend == _Backend.FLASH_ATTN:
        if torch.cuda.get_device_capability()[0] < 8:
            # Volta and Turing NVIDIA GPUs.
            logger.info(
                "Cannot use FlashAttention-2 backend for Volta and Turing "
                "GPUs.")
            selected_backend = _Backend.XFORMERS

if i want run gemma-2-2b on Tesla T4. what attention backend is useful? @youkaichao @robertgshaw2-neuralmagic

youkaichao commented 3 months ago

then you are out of luck. XFORMERS does not support logits capping used in gemma :(

youkaichao commented 3 months ago

use a more recent GPU, that's all I can say. or pray for flash attention to support T4.

fullstackwebdev commented 2 months ago

just got this on a 3090

sld commented 2 months ago

I dealt with it by changing "attn_logit_softcapping": null and "final_logit_softcapping": null in config.json. It succesfully run on 2xNvidia T4 vllm/vllm-openai:v0.6.2. My benchmarks did not get worse.

ubuntu@t4-x2:~/models/google_gemma-2-9b-it$ cat config.json 
{
  "architectures": [
    "Gemma2ForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "attn_logit_softcapping": null,
  "bos_token_id": 2,
  "cache_implementation": "hybrid",
  "eos_token_id": 1,
  "final_logit_softcapping": null,
  "head_dim": 256,
  "hidden_act": "gelu_pytorch_tanh",
  "hidden_activation": "gelu_pytorch_tanh",
  "hidden_size": 3584,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "gemma2",
  "num_attention_heads": 16,
  "num_hidden_layers": 42,
  "num_key_value_heads": 8,
  "pad_token_id": 0,
  "query_pre_attn_scalar": 256,
  "rms_norm_eps": 1e-06,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "sliding_window_size": 4096,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.42.0.dev0",
  "use_cache": true,
  "vocab_size": 256000
}
gerencher commented 1 month ago

@sld Thanks for your solution. I am trying to reproduce it. Where does the config.json get passed? Thanks