vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.8k stars 4.68k forks source link

[Bug]: llama-3.2-11B-vision run in vllm==0.6.3 OOM error(L20) #10569

Closed Jamrainbow closed 1 hour ago

Jamrainbow commented 3 days ago

Your current environment

The output of `python collect_env.py` WARNING 11-22 07:19:14 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. Warning: Your installation of OpenCV appears to be broken: module 'cv2.dnn' has no attribute 'DictValue'.Please follow the instructions at https://github.com/opencv/opencv-python/issues/884 to correct your environment. The import of cv2 has been skipped. WARNING 11-22 07:19:19 config.py:395] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used INFO 11-22 07:19:19 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/home/dataset/Llama-3.2-11B-Vision-Instruct', speculative_config=None, tokenizer='/home/dataset/Llama-3.2-11B-Vision-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/dataset/Llama-3.2-11B-Vision-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None) INFO 11-22 07:19:19 enc_dec_model_runner.py:141] EncoderDecoderModelRunner requires XFormers backend; overriding backend auto-selection and forcing XFormers. INFO 11-22 07:19:19 selector.py:115] Using XFormers backend. /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") INFO 11-22 07:19:20 model_runner.py:1056] Starting to load model /home/dataset/Llama-3.2-11B-Vision-Instruct... INFO 11-22 07:19:20 selector.py:115] Using XFormers backend. Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00 [rank0]: llm = LLM( [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 177, in __init__ [rank0]: self.llm_engine = LLMEngine.from_engine_args( [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 573, in from_engine_args [rank0]: engine = cls( [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 348, in __init__ [rank0]: self._initialize_kv_caches() [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 483, in _initialize_kv_caches [rank0]: self.model_executor.determine_num_available_blocks()) [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks [rank0]: return self.driver_worker.determine_num_available_blocks() [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks [rank0]: self.model_runner.profile_run() [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/enc_dec_model_runner.py", line 359, in profile_run [rank0]: self.execute_model(model_input, kv_caches, intermediate_tensors) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/worker/enc_dec_model_runner.py", line 203, in execute_model [rank0]: hidden_or_intermediate_states = model_executable( [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mllama.py", line 1256, in forward [rank0]: cross_attention_states = self.get_cross_attention_states( [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mllama.py", line 1144, in get_cross_attention_states [rank0]: cross_attention_states = self.vision_model(pixel_values, [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mllama.py", line 552, in forward [rank0]: hidden_state = self.gated_positional_embedding(hidden_state, [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mllama.py", line 318, in forward [rank0]: gated_tile_position_embedding = self.gate.tanh( [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU 0 has a total capacity of 44.52 GiB of which 1.24 GiB is free. Including non-PyTorch memory, this process has 43.28 GiB memory in use. Of the allocated memory 42.77 GiB is allocated by PyTorch, and 175.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) L20 is ![image](https://github.com/user-attachments/assets/ec2a80d4-8eb6-4bf9-b80a-3feaf276569e) pip environment is Package Version Editable project location --------------------------------- -------------------------------- ------------------------- absl-py 2.1.0 accelerate 1.1.1 aiofiles 23.2.1 aiohttp 3.9.1 aiosignal 1.3.1 altair 5.3.0 annotated-types 0.6.0 anyio 4.3.0 apex 0.1 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 asttokens 2.4.1 astunparse 1.6.3 async-timeout 4.0.3 attrs 23.2.0 audioread 3.0.1 av 13.1.0 bandit 1.7.7 beautifulsoup4 4.12.3 bleach 6.1.0 blis 0.7.11 build 1.1.1 bypy 1.8.4 cachetools 5.3.2 catalogue 2.0.10 certifi 2023.11.17 cffi 1.16.0 cfgv 3.4.0 chardet 5.2.0 charset-normalizer 3.3.2 click 8.1.7 cloudpathlib 0.16.0 cloudpickle 3.0.0 cmake 3.28.1 colorama 0.4.6 colored 2.2.4 coloredlogs 15.0.1 comm 0.2.1 compressed-tensors 0.6.0 confection 0.1.4 contourpy 1.2.0 coverage 7.4.4 cubinlinker 0.3.0+2.g405ac64 cuda-python 12.2.0 cudf 23.12.0 cugraph 23.12.0 cugraph-dgl 23.12.0 cugraph-service-client 23.12.0 cugraph-service-server 23.12.0 cuml 23.12.0 cupy-cuda12x 12.3.0 cycler 0.12.1 cymem 2.0.8 Cython 3.0.8 dask 2023.11.0 dask-cuda 23.12.0 dask-cudf 23.12.0 DataProperty 1.0.1 datasets 2.18.0 debugpy 1.8.0 decorator 5.1.1 defusedxml 0.7.1 diffusers 0.15.0 dill 0.3.8 diskcache 5.6.3 distlib 0.3.8 distributed 2023.11.0 distro 1.7.0 dm-tree 0.1.8 einops 0.7.0 evaluate 0.4.1 exceptiongroup 1.2.0 execnet 2.0.2 executing 2.0.1 expecttest 0.1.3 fastapi 0.110.0 fastjsonschema 2.19.1 fastrlock 0.8.2 ffmpy 0.3.2 filelock 3.13.1 flash-attn 2.5.9.post1 flatbuffers 24.3.25 fonttools 4.47.2 frozenlist 1.4.1 fschat 0.2.36 fsspec 2023.12.2 gast 0.5.4 gguf 0.10.0 google-auth 2.26.2 google-auth-oauthlib 0.4.6 gradio 4.26.0 gradio_client 0.15.1 graphsurgeon 0.4.6 graphviz 0.20.3 grpcio 1.60.0 h11 0.14.0 h5py 3.10.0 hf_transfer 0.1.6 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.23.2 humanfriendly 10.0 hypothesis 5.35.1 identify 2.5.35 idna 3.6 importlib-metadata 7.0.1 importlib_resources 6.4.0 iniconfig 2.0.0 intel-openmp 2021.4.0 interegular 0.3.3 ipykernel 6.29.0 ipython 8.20.0 ipython-genutils 0.2.0 janus 1.0.0 jedi 0.19.1 Jinja2 3.1.3 jiter 0.7.1 joblib 1.3.2 json5 0.9.14 jsonlines 4.0.0 jsonschema 4.21.1 jsonschema-specifications 2023.12.1 jupyter_client 8.6.0 jupyter_core 5.7.1 jupyter-tensorboard 0.2.0 jupyterlab 2.3.2 jupyterlab_pygments 0.3.0 jupyterlab-server 1.2.0 jupytext 1.16.1 kiwisolver 1.4.5 langcodes 3.3.0 lark 1.1.9 lazy_loader 0.3 librosa 0.10.1 llvmlite 0.43.0 lm_eval 0.4.2 lm-format-enforcer 0.10.6 locket 1.0.0 lxml 5.1.0 Markdown 3.5.2 markdown-it-py 3.0.0 markdown2 2.4.13 MarkupSafe 2.1.4 matplotlib 3.8.2 matplotlib-inline 0.1.6 mbstrdecoder 1.1.3 mdit-py-plugins 0.4.0 mdurl 0.1.2 mistral_common 1.4.4 mistune 3.0.2 mkl 2021.1.1 mkl-devel 2021.1.1 mkl-include 2021.1.1 mock 5.1.0 more-itertools 10.2.0 mpi4py 3.1.5 mpmath 1.3.0 msgpack 1.0.7 msgspec 0.18.6 multidict 6.0.4 multiprocess 0.70.16 murmurhash 1.0.10 mypy 1.9.0 mypy-extensions 1.0.0 nbclient 0.9.0 nbconvert 7.14.2 nbformat 5.9.2 nest-asyncio 1.5.9 networkx 2.6.3 nh3 0.2.17 ninja 1.11.1.1 nltk 3.8.1 nodeenv 1.8.0 notebook 6.4.10 numba 0.60.0 numexpr 2.9.0 numpy 1.26.4 nvfuser 0.1.1+gitunknown nvidia-ammo 0.7.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-dali-cuda120 1.33.0 nvidia-ml-py 12.550.52 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.4.99 nvidia-nvtx-cu12 12.1.105 nvidia-pyindex 1.0.9 nvtx 0.2.5 oauthlib 3.2.2 onnx 1.15.0rc2 onnx-graphsurgeon 0.3.27 onnxruntime 1.16.3 openai 1.54.4 opencv 4.7.0 opencv-python-headless 4.10.0.84 optimum 1.18.0 optree 0.10.0 orjson 3.10.3 outlines 0.0.46 packaging 23.2 pandas 1.5.3 pandocfilters 1.5.1 parameterized 0.9.0 parso 0.8.3 partd 1.4.1 partial-json-parser 0.2.1.1.post4 pathvalidate 3.2.0 pbr 6.0.0 peft 0.10.0 pexpect 4.9.0 pillow 10.4.0 pip 24.0 platformdirs 4.1.0 pluggy 1.3.0 ply 3.11 polygraphy 0.49.1 pooch 1.8.0 portalocker 2.8.2 pre-commit 3.7.0 preshed 3.0.9 prettytable 3.9.0 prometheus-client 0.19.0 prometheus-fastapi-instrumentator 7.0.0 prompt-toolkit 3.0.43 protobuf 4.24.4 psutil 5.9.4 ptxcompiler 0.8.1+2.g0d406d6 ptyprocess 0.7.0 PuLP 2.8.0 pure-eval 0.2.2 py 1.11.0 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 14.0.1.dev0+gba5374836.d20240125 pyarrow-hotfix 0.6 pyasn1 0.5.1 pyasn1-modules 0.3.0 pybind11 2.11.1 pybind11-global 2.11.1 pybind11-stubgen 2.5 pycocotools 2.0+nv0.8.0 pycountry 24.6.1 pycparser 2.21 pydantic 2.9.2 pydantic_core 2.23.4 pydantic-settings 2.3.0 pydub 0.25.1 Pygments 2.17.2 pylibcugraph 23.12.0 pylibcugraphops 23.12.0 pylibraft 23.12.0 pynvml 11.5.0 pyparsing 3.1.1 pyproject 1.3.1 pyproject_hooks 1.0.0 pytablewriter 1.2.0 pytest 7.4.4 pytest-cov 5.0.0 pytest-flakefinder 1.1.0 pytest-forked 1.6.0 pytest-rerunfailures 13.0 pytest-shard 0.1.2 pytest-xdist 3.5.0 python-dateutil 2.8.2 python-dotenv 1.0.1 python-hostlist 1.23.0 python-multipart 0.0.9 pytorch-quantization 2.1.2 pytz 2023.3.post1 PyYAML 6.0.1 pyzmq 25.1.2 qwen-vl-utils 0.0.8 raft-dask 23.12.0 rapids-dask-dependency 23.12.1 ray 2.10.0 referencing 0.32.1 regex 2023.12.25 requests 2.31.0 requests-oauthlib 1.3.1 requests-toolbelt 1.0.0 responses 0.18.0 rich 13.7.0 rmm 23.12.0 rouge-score 0.1.2 rpds-py 0.17.1 rsa 4.9 ruff 0.4.7 sacrebleu 2.4.1 safetensors 0.4.5 scikit-learn 1.2.0 scipy 1.12.0 semantic-version 2.10.0 Send2Trash 1.8.2 sentencepiece 0.2.0 setuptools 68.2.2 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 smart-open 6.4.0 sniffio 1.3.1 sortedcontainers 2.4.0 soundfile 0.12.1 soupsieve 2.5 soxr 0.3.7 spacy 3.7.2 spacy-legacy 3.0.12 spacy-loggers 1.0.5 sphinx-glpi-theme 0.5 sqlitedict 2.1.0 srsly 2.4.8 stack-data 0.6.3 starlette 0.36.3 stevedore 5.2.0 StrEnum 0.4.15 svgwrite 1.4.3 sympy 1.12 tabledata 1.3.3 tabulate 0.9.0 tbb 2021.11.0 tblib 3.0.0 tcolorpy 0.1.4 tensorboard 2.9.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorrt 9.2.0.post12.dev5 tensorrt-bindings 9.2.0.post12.dev5 tensorrt-libs 9.2.0.post12.dev5 tensorrt-llm 0.8.0 terminado 0.18.0 thinc 8.2.2 threadpoolctl 3.2.0 thriftpy2 0.4.17 tiktoken 0.7.0 tinycss2 1.2.1 tokenizers 0.20.3 toml 0.10.2 tomli 2.0.1 tomlkit 0.12.0 toolz 0.12.1 torch 2.4.0 torch-tensorrt 2.2.0a0 torchdata 0.7.0a0 torchtext 0.17.0a0 torchvision 0.19.0 tornado 6.4 tqdm 4.66.1 tqdm-multiprocess 0.0.11 traitlets 5.9.0 transformers 4.46.2 treelite 3.9.1 treelite-runtime 3.9.1 triton 3.0.0 typepy 1.3.2 typer 0.9.0 types-dataclasses 0.6.6 typing_extensions 4.12.2 ucx-py 0.35.0 uff 0.6.9 urllib3 1.26.18 uvicorn 0.29.0 uvloop 0.19.0 virtualenv 20.25.1 vllm 0.6.3.post1 vllm-flash-attn 2.5.9.post1 wasabi 1.1.2 watchfiles 0.21.0 wavedrom 2.0.3.post3 wcwidth 0.2.13 weasel 0.3.4 webencodings 0.5.1 websockets 11.0.3 Werkzeug 3.0.1 wheel 0.42.0 word2number 1.1 xdoctest 1.0.2 xformers 0.0.27.post2 xgboost 1.7.6 xxhash 3.4.1 yarl 1.9.4 zict 3.0.0 zipp 3.17.0 zstandard 0.22.0 [notice] A new release of pip is available: 24.0 -> 24.3.1 [notice] To update, run: python -m pip install --upgrade pip

Model Input Dumps

No response

🐛 Describe the bug

from PIL import Image from transformers import AutoTokenizer from vllm import LLM, SamplingParams from vllm.inputs import TokensPrompt

import torch import torchvision.transforms as T from torchvision.transforms.functional import InterpolationMode from vllm.assets.image import ImageAsset import argparse parser = argparse.ArgumentParser()

from decord import VideoReader, cpu

parser.add_argument('--batch_size', type=int, default=1) parser.add_argument('--input_len', type=int, default=128) parser.add_argument('--output_len', type=int, default=1024) args = parser.parse_args()

model_name = "/home/dataset/Llama-3.2-11B-Vision-Instruct"

llm = LLM( model=model_name, tensor_parallel_size=1, max_model_len=4096, trust_remote_code=True, enforce_eager=True ) sampling_params = SamplingParams(temperature=1, max_tokens=args.output_len,ignore_eos=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

query= 'Please describe the image in detail.' TEMPLATE = "<|im_start|>User\n{prompt}<|im_end|>\n<|im_start|>Assistant\n"

prompt = f"<|image|><|begin_of_text|>{query}\n" prompt = TEMPLATE.format(prompt=prompt)

image = Image.open("3.png")

inputs = [{ "prompt": prompt, "multi_modaldata": { "image": image, }, } for in range(args.batch_size)]

print("**** model generate begin **") import time start = time.time()

outputs = llm.generate( inputs, sampling_params=sampling_params ) cost = time.time()-start print("model total cost is ",cost)

for o in outputs: generated_text = o.outputs[0].text print(generated_text)

Before submitting a new issue...

Jamrainbow commented 3 days ago

and I try set tensor_parallel_size=2,use two L20,but it still do not work,bug is same

llm = LLM( model=model_name, tensor_parallel_size=2, max_model_len=4096, trust_remote_code=True, enforce_eager=True )

DarkLight1337 commented 3 days ago

You can reduce max_num_seqs (e.g. to 2) to avoid OOM.

ccruttjr commented 4 hours ago

yeah not a bug it's been asked about