vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.37k stars 4.6k forks source link

[Bug]: Qwen2 VL takes only 18Gb when run by using hugggingface code, but the same model takes 38 GB GPU memory with VLM #10357

Open Samjith888 opened 3 days ago

Samjith888 commented 3 days ago

Your current environment

The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ```

Model Input Dumps

Thanks for the great work. I used to run multimodels by using huggingface, recently i heard about vLLM and checked the same model in huggingface and by using the code in vLLM. Noted the GPU usage is 1x than in vLLM. Please check the below results for more information.

Model : Qwen2-VL-7B-Instruct Machine: NVIDIA A100 Image & prompt: Used same image and prompt for both experiments.

🐛 Describe the bug

Huggingface code

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

GPU Usage : image

vLLM

Code Command run python test.py --modality "image" --model-type qwen2_vl

Ouput logs:

<PIL.Image.Image image mode=RGB size=1280x720 at 0x7FB31431D1E0>                                                                                                                                                   
INFO 11-15 08:28:01 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='Qwen/Qwen2-VL-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-VL-7B-Instruct', skip_tokenizer_ini
t=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_di
r=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, 
device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute
_time=False), seed=0, served_model_name=Qwen/Qwen2-VL-7B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, us
e_cached_outputs=False, mm_processor_kwargs={'min_pixels': 784, 'max_pixels': 1003520})                                                                                                                            
INFO 11-15 08:28:02 model_runner.py:1056] Starting to load model Qwen/Qwen2-VL-7B-Instruct...                                                                                                                      
WARNING 11-15 08:28:02 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.            
WARNING 11-15 08:28:02 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.            
WARNING 11-15 08:28:02 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.            
WARNING 11-15 08:28:02 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.            
INFO 11-15 08:28:03 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:02,  1.40it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.42it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:01<00:00,  2.01it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:05<00:01,  1.97s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:06<00:00,  1.51s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:06<00:00,  1.32s/it]

INFO 11-15 08:28:10 model_runner.py:1067] Loading model weights took 15.5083 GB
WARNING 11-15 08:28:10 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'max_pixels', 'min_pixels'}
WARNING 11-15 08:28:10 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'max_pixels', 'min_pixels'}
WARNING 11-15 08:28:10 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'max_pixels', 'min_pixels'}
WARNING 11-15 08:28:10 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'max_pixels', 'min_pixels'}
WARNING 11-15 08:28:10 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'max_pixels', 'min_pixels'}
WARNING 11-15 08:28:11 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'max_pixels', 'min_pixels'}
/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that inst
ead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that inst
ead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 11-15 08:28:12 gpu_executor.py:122] # GPU blocks: 21539, # CPU blocks: 4681
INFO 11-15 08:28:12 gpu_executor.py:126] Maximum concurrency for 4096 tokens per request: 84.14x
INFO 11-15 08:28:16 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use
 '--enforce-eager' in the CLI.
INFO 11-15 08:28:16 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can a
lso reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-15 08:28:18 model_runner.py:1523] Graph capturing finished in 2 secs.
WARNING 11-15 08:28:18 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'max_pixels', 'min_pixels'}
WARNING 11-15 08:28:21 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'max_pixels', 'min_pixels'}
WARNING 11-15 08:28:21 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'max_pixels', 'min_pixels'}
WARNING 11-15 08:28:21 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'max_pixels', 'min_pixels'}
Processed prompts:   0%|                                                                                                                 | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
WARNING 11-15 08:28:21 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'max_pixels', 'min_pixels'}
WARNING 11-15 08:28:21 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'max_pixels', 'min_pixels'}
WARNING 11-15 08:28:21 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'max_pixels', 'min_pixels'}
WARNING 11-15 08:28:21 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'max_pixels', 'min_pixels'}
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.45it/s, est. speed input: 3075.64 toks/s, output: 69.90 toks/s]
In the image, one person is driving car.

GPU Usage : image

DarkLight1337 commented 3 days ago

By default, vLLM will take up 90% of GPU regardless of model size - the extra memory is used for KV cache. If you don't want to use so much memory, you can set --gpu-memory-utilization argument.

Samjith888 commented 3 days ago

There is no option to set --gpu-memory-utilization in this script

DarkLight1337 commented 3 days ago

You can modify that script locally as you wish (just pass gpu_memory_utilization to LLM constructor). That file is only for demonstration purposes and is not designed to accommodate every argument.

Samjith888 commented 3 days ago
def run_qwen2_vl(question: str, modality: str):
    assert modality == "image"

    model_name = "Qwen/Qwen2-VL-7B-Instruct"

    llm = LLM(
        model=model_name,
        max_model_len=4096,
        max_num_seqs=5,
        # Note - mm_processor_kwargs can also be passed to generate/chat calls
        mm_processor_kwargs={
            "min_pixels": 28 * 28,
            "max_pixels": 1280 * 28 * 28,
        },
        gpu_memory_utilization=True
    )

    prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
              "<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>"
              f"{question}<|im_end|>\n"
              "<|im_start|>assistant\n")
    stop_token_ids = None
    return llm, prompt, stop_token_ids

Added the gpu_memory_utilization flag into the script. But now getting error

WARNING 11-15 11:25:14 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 11-15 11:25:14 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 11-15 11:25:14 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
INFO 11-15 11:25:14 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241115-112514.pkl...
WARNING 11-15 11:25:15 model_runner_base.py:143] Failed to pickle inputs of failed execution: Can't pickle local object 'weak_bind.<locals>.weak_bound'
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1658, in execute_model
[rank0]:     hidden_or_intermediate_states = model_executable(
[rank0]:   File "/data/home/samjith/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/data/home/samjith/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_vl.py", line 1048, in forward
[rank0]:     image_embeds = self._process_image_input(image_input)
[rank0]:   File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_vl.py", line 978, in _process_image_input
[rank0]:     pixel_values = image_input["data"].type(self.visual.dtype)
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 22.38 MiB is free. Including non-PyTorch memory, this process has 39.35 GiB memo
ry in use. Of the allocated memory 38.58 GiB is allocated by PyTorch, with 1.90 MiB allocated in private pools (e.g., CUDA Graphs), and 40.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocat
ed memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-vari
ables)

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/home/samjith/test_vllm/test.py", line 541, in <module>
[rank0]:     main(args)
[rank0]:   File "/data/home/samjith/test_vllm/test.py", line 506, in main
[rank0]:     outputs = llm.generate(inputs, sampling_params=sampling_params)
[rank0]:   File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/utils.py", line 1063, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 353, in generate
[rank0]:     outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]:   File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 879, in _run_engine
[rank0]:     step_outputs = self.llm_engine.step()
[rank0]:   File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1389, in step
[rank0]:     outputs = self.model_executor.execute_model(
[rank0]:   File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 134, in execute_model
[rank0]:     output = self.driver_worker.execute_model(execute_model_req)
[rank0]:   File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
[rank0]:     output = self.model_runner.execute_model(
[rank0]:   File "/data/home/samjith/.local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 146, in _wrapper
[rank0]:     raise type(err)(f"Error in model execution: "
[rank0]: torch.OutOfMemoryError: Error in model execution: CUDA out of memory. Tried to allocate 44.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 22.38 MiB is free. Including non-PyTorch memory, this 
process has 39.35 GiB memory in use. Of the allocated memory 38.58 GiB is allocated by PyTorch, with 1.90 MiB allocated in private pools (e.g., CUDA Graphs), and 40.25 MiB is reserved by PyTorch but unallocated.
 If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/
cuda.html#environment-variables)
Processed prompts:   0%|                                                                                                                 | 0/4 [00:01<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

image

DarkLight1337 commented 3 days ago

gpu_memory_utilization should be a float between 0 and 1 (default 0.9). Since pytorch will always use a bit of GPU memory, you should set this to a value less than 1.

Samjith888 commented 3 days ago

gpu_memory_utilization=0.1

Still it takes around 33 GB, earlier it was 38 GB. image

I'm really wondered why VLLM consumes 1x memory than huggingface raw code.

DarkLight1337 commented 3 days ago

Can you copy the startup log output from vLLM (may need to set log level to debug)? It should log how much memory is taken by the model vs KV cache.

Samjith888 commented 3 days ago

gpu_memory_utilization=0.5

INFO 11-15 12:03:17 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='Qwen/Qwen2-VL-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-VL-7B-Instruct', skip_tokenizer_ini
t=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_di
r=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, 
device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute
_time=False), seed=0, served_model_name=Qwen/Qwen2-VL-7B-Instruct, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, us
e_cached_outputs=False, mm_processor_kwargs={'min_pixels': 784, 'max_pixels': 1003520})                                                                                                                            
[W1115 12:03:19.498690632 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())                                                                              
INFO 11-15 12:03:19 model_runner.py:1056] Starting to load model Qwen/Qwen2-VL-7B-Instruct...                                                                                                                      
WARNING 11-15 12:03:19 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.            
WARNING 11-15 12:03:19 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.            
WARNING 11-15 12:03:19 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
WARNING 11-15 12:03:19 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
WARNING 11-15 12:03:19 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
WARNING 11-15 12:03:19 utils.py:513] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
INFO 11-15 12:03:20 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:03,  1.24it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.17it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:02<00:01,  1.62it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:02<00:00,  1.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.39it/s]

INFO 11-15 12:03:24 model_runner.py:1067] Loading model weights took 15.4517 GB
WARNING 11-15 12:03:24 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 11-15 12:03:25 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 11-15 12:03:25 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 11-15 12:03:25 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 11-15 12:03:25 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
WARNING 11-15 12:03:25 utils.py:1401] The following intended overrides are not keyword-only args and and will be dropped: {'min_pixels', 'max_pixels'}
/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that inst
ead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/data/home/samjith/anaconda3/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that inst
ead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 11-15 12:03:27 gpu_executor.py:122] # GPU blocks: 3759, # CPU blocks: 4681
INFO 11-15 12:03:27 gpu_executor.py:126] Maximum concurrency for 4096 tokens per request: 14.68x
INFO 11-15 12:03:31 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use
 '--enforce-eager' in the CLI.
INFO 11-15 12:03:31 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can a
lso reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-15 12:03:33 model_runner.py:1523] Graph capturing finished in 2 secs.
DarkLight1337 commented 3 days ago

The logs say that 15GB is taken up by the model, which is comparable with HuggingFace. The rest are probably for KV cache.

Also note that additional memory is allocated for calling the model. You can reduce the batch size by setting max_num_seqs.

Samjith888 commented 3 days ago

Okay Understood. I was planning to use vLLM, but its takes 1x GPU memory than the huggingface raw code.