vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.29k stars 4.75k forks source link

[Performance]: Image preprocessing is executed twice for same image during VLLM(Qwen2-vl) inference #8316

Closed ZhangYaoFu closed 2 months ago

ZhangYaoFu commented 2 months ago

Proposal to improve performance

Only perform image preprocessing once

Report of performance regression

No response

Misc discussion on performance

image

test code: https://github.com/vllm-project/vllm/pull/7905

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

youkaichao commented 2 months ago

cc @ywang96 @DarkLight1337

DarkLight1337 commented 2 months ago

Are you inputting more than one image at a time, or using multiple processes? I'm not actually sure under what circumstances the profiler would show two separate instances of the same function being called...

ZhangYaoFu commented 2 months ago

Are you inputting more than one image at a time, or using multiple processes? I'm not actually sure under what circumstances the profiler would show two separate instances of the same function being called...

Use batch inference to infer multiple requests at a time, but each request only contains one image.

The problem is that for the same image, the same image preprocessing is performed twice in the add_request stage and execute_model stage.

test code:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '5'
import sys
sys.path.append("/data0/src/qwen2vl_vllm/vllm-add_qwen2_vl_new")
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
import vllm
print(f'vllm version:{vllm.__version__}')

MODEL_PATH = "."

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 10, "video": 10},
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.2,
    max_tokens=20,
    stop_token_ids=[],
)

processor = AutoProcessor.from_pretrained(MODEL_PATH)

directory = "./imgs"

llm_input_list = []
import time
start = time.time()

cnt = 0
for item in os.listdir(directory):
    item_path = os.path.join(directory, item)
    if os.path.isfile(item_path):
        print(f"File: {item_path}")
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image": item_path,
                        "min_pixels": 840 * 420,
                        "max_pixels": 840 * 420,
                    },
                    {"type": "text", "text": "What does this diagram illustrate?"},
                ],
            },
        ]

        prompt = processor.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True,
        )
        image_inputs, video_inputs = process_vision_info(messages)

        mm_data = {}
        if image_inputs is not None:
            mm_data["image"] = image_inputs
        if video_inputs is not None:
            mm_data["video"] = video_inputs

        llm_inputs = {
            "prompt": prompt,
            "multi_modal_data": mm_data,
        }
        llm_input_list.append(llm_inputs)
        cnt = cnt + 1
        if cnt > 30:
            break

from torch.profiler import profile, record_function, ProfilerActivity
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    with_stack=True,
) as prof:
    outputs = llm.generate(llm_input_list, sampling_params=sampling_params)
prof.export_chrome_trace("trace.json")
for output in outputs:
    print(f'{output.outputs[0].text}')

end = time.time()``
DarkLight1337 commented 2 months ago

I see. In this case it seems that we may add another layer to cache the images. @ywang96 do we have an existing benchmark for multi-modal models?

ywang96 commented 2 months ago

I took a look at the code path in the PR.

The first call of the processor processes the input sequence (I'm not sure why the image processor is also called here, and it shouldn't be called, so we need to take a look at this processor to see what's necessarily needed.)

The second call actually processes the input images, as expected.

ywang96 commented 2 months ago

I see. In this case it seems that we may add another layer to cache the images. @ywang96 do we have an existing benchmark for multi-modal models?

We don't unfortunately, but it shouldn't be hard to add it to the existing benchmark framework since output is always text.

fyabc commented 2 months ago

I took a look at the code path in the PR.

The first call of the processor processes the input sequence (I'm not sure why the image processor is also called here, and it shouldn't be called, so we need to take a look at this processor to see what's necessarily needed.)

The second call actually processes the input images, as expected.

Thank you for your comments! I will fix this asap.

fyabc commented 2 months ago

@ZhangYaoFu Hi, this commit has removed redundant image transforms in input_processor_for_qwen2_vl, please check it again.

ZhangYaoFu commented 2 months ago

@ZhangYaoFu Hi, this commit has removed redundant image transforms in input_processor_for_qwen2_vl, please check it again.

Thank you so much! I'll try it.

ZhangYaoFu commented 2 months ago

@ZhangYaoFu Hi, this commit has removed redundant image transforms in input_processor_for_qwen2_vl, please check it again.

I've tested this feature and it's fine. Awesome

DarkLight1337 commented 2 months ago

The main issue has been solved so I'm closing this. We can revisit image caching in general if its performance becomes a significant concern.