vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.91k stars 3.79k forks source link

[Usage]: Gemma2-9b not working on A10G 24gb gpu #6242

Open Abhinay2323 opened 1 month ago

Abhinay2323 commented 1 month ago

Your current environment

Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.30.0 Libc version: glibc-2.31

Python version: 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G Nvidia driver version: 535.104.12 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7R32 Stepping: 0 CPU MHz: 3291.417 BogoMIPS: 5600.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 64 KiB L1i cache: 64 KiB L2 cache: 1 MiB L3 cache: 8 MiB NUMA node0 CPU(s): 0-3 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid

Versions of relevant libraries: [pip3] flashinfer==0.0.8+cu121torch2.3 [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] pytorch-lightning==2.3.0 [pip3] torch==2.3.0 [pip3] torchmetrics==1.4.0.post0 [pip3] torchvision==0.18.0 [pip3] transformers==4.42.3 [pip3] triton==2.3.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-3 N/A N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

How would you like to use vllm

Hi @robertgshaw2-neuralmagic

Command used:

VLLM_ATTENTION_BACKEND=FLASHINFER python3 -m vllm.entrypoints.openai.api_server --model google/gemma-2-9b-it --trust-remote-code --download-dir /opt/dlami/nvme --max-model-len 512 --gpu-memory-utilization 0.95

ERROR: rank0]: Traceback (most recent call last): rank0: File "", line 198, in _run_module_as_main rank0: File "", line 88, in _run_code rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 216, in rank0: engine = AsyncLLMEngine.from_engine_args(

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args rank0: engine = cls(

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 360, in init rank0: self.engine = self._init_engine(*args, **kwargs)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine rank0: return engine_class(*args, **kwargs)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 256, in init

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 366, in _initialize_kv_caches rank0: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 87, in initialize_cache rank0: self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/worker/worker.py", line 214, in initialize_cache

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/worker/worker.py", line 230, in _warm_up_model

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1109, in capture_model

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1327, in capture

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, **kwargs)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, **kwargs)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 336, in forward rank0: hidden_states = self.model(input_ids, positions, kv_caches,

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, **kwargs)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, **kwargs)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 277, in forward rank0: hidden_states, residual = layer(

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, **kwargs)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, **kwargs)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 229, in forward rank0: hidden_states, residual = self.pre_feedforward_layernorm(

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, **kwargs)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, **kwargs)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/model_executor/custom_op.py", line 13, in forward rank0: return self._forward_method(*args, **kwargs)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/model_executor/layers/layernorm.py", line 143, in forward_cuda rank0: return self.forward_native(x, residual)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/model_executor/layers/layernorm.py", line 129, in forward_native rank0: variance = x.pow(2).mean(dim=-1, keepdim=True)

rank0: RuntimeError: CUDA error: an illegal memory access was encountered rank0: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank0: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank0: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

When i set gpu-utilization 0.9 ,this is the ERROR: rank0: Traceback (most recent call last): rank0: File "", line 198, in _run_module_as_main rank0: File "", line 88, in _run_code rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 216, in rank0: engine = AsyncLLMEngine.from_engine_args(

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args rank0: engine = cls(

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 360, in init rank0: self.engine = self._init_engine(*args, **kwargs)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine rank0: return engine_class(*args, **kwargs)

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 256, in init

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 366, in _initialize_kv_caches rank0: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/executor/gpu_executor.py", line 87, in initialize_cache rank0: self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/worker/worker.py", line 206, in initialize_cache

rank0: File "/mnt/datadrive2/finetuning_venv/lib/python3.11/site-packages/vllm/worker/worker.py", line 348, in raise_if_cache_size_invalid rank0: raise ValueError("No available memory for the cache blocks. " rank0: ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.

I am unable to use gemma2(18GB) on A10G (24GB) gpu may i know the reason, can i solve this issue by doing any changes?

KwanWaiChung commented 1 month ago

Having the same issue of RuntimeError: CUDA error: an illegal memory access was encountered in A100 80G. Not sure how everyone is able to use gemma2-9b.

ssmi153 commented 1 month ago

VRAM seems to be an issue with Gemma on a 24GB VRAM GPU. The new FP8 automatic quantization format seems to be able to resolve this (and works on Ampere GPUs). This configuration works for me on an A5000 (which is effectively an A10G):

--model google/gemma-2-9b-it --tensor-parallel-size 1 --max-model-len 4096 --quantization fp8

Also, as you've done previously you need to pass in an environment variable to set VLLM_ATTENTION_BACKEND = FLASHINFER