Closed ZhangYaoFu closed 2 months ago
cc @ywang96 @DarkLight1337
Are you inputting more than one image at a time, or using multiple processes? I'm not actually sure under what circumstances the profiler would show two separate instances of the same function being called...
Are you inputting more than one image at a time, or using multiple processes? I'm not actually sure under what circumstances the profiler would show two separate instances of the same function being called...
Use batch inference to infer multiple requests at a time, but each request only contains one image.
The problem is that for the same image, the same image preprocessing is performed twice in the add_request stage and execute_model stage.
test code:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '5'
import sys
sys.path.append("/data0/src/qwen2vl_vllm/vllm-add_qwen2_vl_new")
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info
import vllm
print(f'vllm version:{vllm.__version__}')
MODEL_PATH = "."
llm = LLM(
model=MODEL_PATH,
limit_mm_per_prompt={"image": 10, "video": 10},
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.2,
max_tokens=20,
stop_token_ids=[],
)
processor = AutoProcessor.from_pretrained(MODEL_PATH)
directory = "./imgs"
llm_input_list = []
import time
start = time.time()
cnt = 0
for item in os.listdir(directory):
item_path = os.path.join(directory, item)
if os.path.isfile(item_path):
print(f"File: {item_path}")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image",
"image": item_path,
"min_pixels": 840 * 420,
"max_pixels": 840 * 420,
},
{"type": "text", "text": "What does this diagram illustrate?"},
],
},
]
prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)
mm_data = {}
if image_inputs is not None:
mm_data["image"] = image_inputs
if video_inputs is not None:
mm_data["video"] = video_inputs
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
}
llm_input_list.append(llm_inputs)
cnt = cnt + 1
if cnt > 30:
break
from torch.profiler import profile, record_function, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
with_stack=True,
) as prof:
outputs = llm.generate(llm_input_list, sampling_params=sampling_params)
prof.export_chrome_trace("trace.json")
for output in outputs:
print(f'{output.outputs[0].text}')
end = time.time()``
I see. In this case it seems that we may add another layer to cache the images. @ywang96 do we have an existing benchmark for multi-modal models?
I took a look at the code path in the PR.
The first call of the processor processes the input sequence (I'm not sure why the image processor is also called here, and it shouldn't be called, so we need to take a look at this processor to see what's necessarily needed.)
The second call actually processes the input images, as expected.
I see. In this case it seems that we may add another layer to cache the images. @ywang96 do we have an existing benchmark for multi-modal models?
We don't unfortunately, but it shouldn't be hard to add it to the existing benchmark framework since output is always text.
I took a look at the code path in the PR.
The first call of the processor processes the input sequence (I'm not sure why the image processor is also called here, and it shouldn't be called, so we need to take a look at this processor to see what's necessarily needed.)
The second call actually processes the input images, as expected.
Thank you for your comments! I will fix this asap.
@ZhangYaoFu Hi, this commit has removed redundant image transforms in input_processor_for_qwen2_vl
, please check it again.
@ZhangYaoFu Hi, this commit has removed redundant image transforms in
input_processor_for_qwen2_vl
, please check it again.
Thank you so much! I'll try it.
@ZhangYaoFu Hi, this commit has removed redundant image transforms in
input_processor_for_qwen2_vl
, please check it again.
I've tested this feature and it's fine. Awesome
The main issue has been solved so I'm closing this. We can revisit image caching in general if its performance becomes a significant concern.
Proposal to improve performance
Only perform image preprocessing once
Report of performance regression
No response
Misc discussion on performance
test code: https://github.com/vllm-project/vllm/pull/7905
Your current environment (if you think it is necessary)
Before submitting a new issue...