Closed wlwqq closed 3 months ago
i can run success in A10, but T4 hangs
might be a flashinfer problem, cc @LiuXiaoxuanPKU
my flashinfer version flashinfer-0.0.8+cu121torch2.3-cp38-cp38-linux_x86_64.whl
@youkaichao @LiuXiaoxuanPKU
FlashInfer needs compute capability >8.0 (see doc: https://docs.flashinfer.ai/installation.html), so it will not work on T4. We should have a better error message than this though.
If you install the latest nightly, you can run gemma with logits soft capping via flash attention backend
so, i should update os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER" to os.environ["VLLM_ATTENTION_BACKEND"] = "FLASH_ATTN" ? @robertgshaw2-neuralmagic
vllm 0.5.1 in selector.py
if selected_backend == _Backend.FLASH_ATTN:
if torch.cuda.get_device_capability()[0] < 8:
# Volta and Turing NVIDIA GPUs.
logger.info(
"Cannot use FlashAttention-2 backend for Volta and Turing "
"GPUs.")
selected_backend = _Backend.XFORMERS
if i want run gemma-2-2b on Tesla T4. what attention backend is useful? @youkaichao @robertgshaw2-neuralmagic
then you are out of luck. XFORMERS does not support logits capping used in gemma :(
use a more recent GPU, that's all I can say. or pray for flash attention to support T4.
just got this on a 3090
I dealt with it by changing "attn_logit_softcapping": null
and "final_logit_softcapping": null
in config.json. It succesfully run on 2xNvidia T4 vllm/vllm-openai:v0.6.2. My benchmarks did not get worse.
ubuntu@t4-x2:~/models/google_gemma-2-9b-it$ cat config.json
{
"architectures": [
"Gemma2ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"attn_logit_softcapping": null,
"bos_token_id": 2,
"cache_implementation": "hybrid",
"eos_token_id": 1,
"final_logit_softcapping": null,
"head_dim": 256,
"hidden_act": "gelu_pytorch_tanh",
"hidden_activation": "gelu_pytorch_tanh",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 8192,
"model_type": "gemma2",
"num_attention_heads": 16,
"num_hidden_layers": 42,
"num_key_value_heads": 8,
"pad_token_id": 0,
"query_pre_attn_scalar": 256,
"rms_norm_eps": 1e-06,
"rope_theta": 10000.0,
"sliding_window": 4096,
"sliding_window_size": 4096,
"torch_dtype": "bfloat16",
"transformers_version": "4.42.0.dev0",
"use_cache": true,
"vocab_size": 256000
}
@sld Thanks for your solution. I am trying to reproduce it. Where does the config.json get passed? Thanks
Your current environment
The output of `python collect_env.py`
```text Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: Could not collect Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.31 Python version: 3.8.19 | packaged by conda-forge | (default, Mar 20 2024, 12:47:35) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-4.19.95-35-x86_64-with-glibc2.10 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Tesla T4 Nvidia driver version: 470.161.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: /bin/sh: lscpu: not found Versions of relevant libraries: [pip3] flashinfer==0.0.8+cu121torch2.3 [pip3] numpy==1.24.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.0 [pip3] torchvision==0.18.0 [pip3] transformers==4.44.0 [pip3] triton==2.3.0 [conda] flashinfer 0.0.8+cu121torch2.3 pypi_0 pypi [conda] numpy 1.24.4 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] torchvision 0.18.0 pypi_0 pypi [conda] transformers 4.44.0 pypi_0 pypi [conda] triton 2.3.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU0 X 24-47,72-95 1 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```🐛 Describe the bug
when i run above code, load model hangs