Open Samjith888 opened 3 days ago
By default, vLLM will take up 90% of GPU regardless of model size - the extra memory is used for KV cache. If you don't want to use so much memory, you can set --gpu-memory-utilization
argument.
There is no option to set --gpu-memory-utilization in this script
You can modify that script locally as you wish (just pass gpu_memory_utilization
to LLM
constructor). That file is only for demonstration purposes and is not designed to accommodate every argument.
def run_qwen2_vl(question: str, modality: str):
assert modality == "image"
model_name = "Qwen/Qwen2-VL-7B-Instruct"
llm = LLM(
model=model_name,
max_model_len=4096,
max_num_seqs=5,
# Note - mm_processor_kwargs can also be passed to generate/chat calls
mm_processor_kwargs={
"min_pixels": 28 * 28,
"max_pixels": 1280 * 28 * 28,
},
gpu_memory_utilization=True
)
prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
"<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>"
f"{question}<|im_end|>\n"
"<|im_start|>assistant\n")
stop_token_ids = None
return llm, prompt, stop_token_ids
Added the gpu_memory_utilization
flag into the script. But now getting error
WARNING 11-15 11:25:14 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 11-15 11:25:14 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 11-15 11:25:14 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
INFO 11-15 11:25:14 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241115-112514.pkl...
WARNING 11-15 11:25:15 model_runner_base.py:143] Failed to pickle inputs of failed execution: Can't pickle local object 'weak_bind.<locals>.weak_bound'
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1658, in execute_model
[rank0]: hidden_or_intermediate_states = model_executable(
[rank0]: File "/data/home/samjith/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/data/home/samjith/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_vl.py", line 1048, in forward
[rank0]: image_embeds = self._process_image_input(image_input)
[rank0]: File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_vl.py", line 978, in _process_image_input
[rank0]: pixel_values = image_input["data"].type(self.visual.dtype)
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 22.38 MiB is free. Including non-PyTorch memory, this process has 39.35 GiB memo
ry in use. Of the allocated memory 38.58 GiB is allocated by PyTorch, with 1.90 MiB allocated in private pools (e.g., CUDA Graphs), and 40.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocat
ed memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-vari
ables)
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/home/samjith/test_vllm/test.py", line 541, in <module>
[rank0]: main(args)
[rank0]: File "/data/home/samjith/test_vllm/test.py", line 506, in main
[rank0]: outputs = llm.generate(inputs, sampling_params=sampling_params)
[rank0]: File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/utils.py", line 1063, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 353, in generate
[rank0]: outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]: File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 879, in _run_engine
[rank0]: step_outputs = self.llm_engine.step()
[rank0]: File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1389, in step
[rank0]: outputs = self.model_executor.execute_model(
[rank0]: File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 134, in execute_model
[rank0]: output = self.driver_worker.execute_model(execute_model_req)
[rank0]: File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
[rank0]: output = self.model_runner.execute_model(
[rank0]: File "/data/home/samjith/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 146, in _wrapper
[rank0]: raise type(err)(f"Error in model execution: "
[rank0]: torch.OutOfMemoryError: Error in model execution: CUDA out of memory. Tried to allocate 44.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 22.38 MiB is free. Including non-PyTorch memory, this
process has 39.35 GiB memory in use. Of the allocated memory 38.58 GiB is allocated by PyTorch, with 1.90 MiB allocated in private pools (e.g., CUDA Graphs), and 40.25 MiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/
cuda.html#environment-variables)
Processed prompts: 0%| | 0/4 [00:01<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
gpu_memory_utilization
should be a float between 0 and 1 (default 0.9). Since pytorch will always use a bit of GPU memory, you should set this to a value less than 1.
gpu_memory_utilization=0.1
Still it takes around 33 GB, earlier it was 38 GB.
I'm really wondered why VLLM consumes 1x memory than huggingface raw code.
Can you copy the startup log output from vLLM (may need to set log level to debug)? It should log how much memory is taken by the model vs KV cache.
gpu_memory_utilization=0.5
INFO 11-15 12:03:17 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='Qwen/Qwen2-VL-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-VL-7B-Instruct', skip_tokenizer_ini
t=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_di
r=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None,
device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute
_time=False), seed=0, served_model_name=Qwen/Qwen2-VL-7B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, us
e_cached_outputs=False, mm_processor_kwargs={'min_pixels': 784, 'max_pixels': 1003520})
[W1115 12:03:19.498690632 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
INFO 11-15 12:03:19 model_runner.py:1056] Starting to load model Qwen/Qwen2-VL-7B-Instruct...
WARNING 11-15 12:03:19 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
WARNING 11-15 12:03:19 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
WARNING 11-15 12:03:19 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
WARNING 11-15 12:03:19 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
WARNING 11-15 12:03:19 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
WARNING 11-15 12:03:19 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
INFO 11-15 12:03:20 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:03, 1.24it/s]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:01<00:02, 1.17it/s]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:02<00:01, 1.62it/s]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:02<00:00, 1.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00, 1.39it/s]
INFO 11-15 12:03:24 model_runner.py:1067] Loading model weights took 15.4517 GB
WARNING 11-15 12:03:24 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 11-15 12:03:25 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 11-15 12:03:25 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 11-15 12:03:25 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 11-15 12:03:25 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 11-15 12:03:25 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that inst
ead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that inst
ead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 11-15 12:03:27 gpu_executor.py:122] # GPU blocks: 3759, # CPU blocks: 4681
INFO 11-15 12:03:27 gpu_executor.py:126] Maximum concurrency for 4096 tokens per request: 14.68x
INFO 11-15 12:03:31 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use
'--enforce-eager' in the CLI.
INFO 11-15 12:03:31 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can a
lso reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-15 12:03:33 model_runner.py:1523] Graph capturing finished in 2 secs.
The logs say that 15GB is taken up by the model, which is comparable with HuggingFace. The rest are probably for KV cache.
Also note that additional memory is allocated for calling the model. You can reduce the batch size by setting max_num_seqs
.
Okay Understood. I was planning to use vLLM, but its takes 1x GPU memory than the huggingface raw code.
Your current environment
The output of `python collect_env.py`
```text Your output of `python collect_env.py` here ```Model Input Dumps
Thanks for the great work. I used to run multimodels by using huggingface, recently i heard about vLLM and checked the same model in huggingface and by using the code in vLLM. Noted the GPU usage is 1x than in vLLM. Please check the below results for more information.
Model : Qwen2-VL-7B-Instruct Machine: NVIDIA A100 Image & prompt: Used same image and prompt for both experiments.
🐛 Describe the bug
Huggingface code
GPU Usage :
vLLM
Code Command run
python test.py --modality "image" --model-type qwen2_vl
Ouput logs:
GPU Usage :