AmazDeng commented 5 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.11.4 (main, Jul  5 2023, 13:45:01) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.2.91
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.0.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
架构：                              x86_64
CPU 运行模式：                      32-bit, 64-bit
字节序：                            Little Endian
Address sizes:                      46 bits physical, 48 bits virtual
CPU:                                80
在线 CPU 列表：                     0-79
每个核的线程数：                    2
每个座的核数：                      20
座：                                2
NUMA 节点：                         2
厂商 ID：                           GenuineIntel
CPU 系列：                          6
型号：                              85
型号名称：                          Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz
步进：                              4
CPU MHz：                           1068.607
CPU 最大 MHz：                      3000.0000
CPU 最小 MHz：                      1000.0000
BogoMIPS：                          5000.00
虚拟化：                            VT-x
L1d 缓存：                          1.3 MiB
L1i 缓存：                          1.3 MiB
L2 缓存：                           40 MiB
L3 缓存：                           55 MiB
NUMA 节点0 CPU：                    0-19,40-59
NUMA 节点1 CPU：                    20-39,60-79
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Mitigation; Clear CPU buffers; SMT vulnerable
标记：                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] torch==2.1.2
[pip3] torchvision==0.16.2
[pip3] triton==2.1.0
[conda] numpy                     1.26.3                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.18.1                   pypi_0    pypi
[conda] torch                     2.1.2                    pypi_0    pypi
[conda] torchvision               0.16.2                   pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      20-39,60-79     1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I fine-tuned the llava-v1.5-7b model and saved it in the llava-1.5-7b-hf format (https://huggingface.co/llava-hf/llava-1.5-7b-hf). I compared the inference results of three different methods. One is using the original llava project code for inference, the second is using LlavaForConditionalGeneration from Hugging Face for inference, and the third is using vllm for inference. The results are as follows. Clearly, the inference results of the first two methods are correct, while the inference result of vllm is incorrect. Why does this happen? Hugging Face and vllm use the same model checkpoint ! Furthermore, I compared the weight values inside llava-v1.5-7b and llava-v1.5-7b-hf, and found that the values are identical.

result

1.original llava project inference result: The image depicts a whimsical and colorful scene, likely a part of a themed restaurant or store. The main focus is a large, white cat statue with a blue bow, which appears to be a mascot or decorative figure. The cat is positioned in front of a wooden door, which is part of a larger, castle-like structure. This structure is adorned with various mushroom-shaped decorations, adding to the fantasy theme.\n\nIn the background, there are several other mushroom-shaped decorations, some of which are hanging from the ceiling. The overall atmosphere of the scene is playful and imaginative, likely designed to attract the attention of children and adults alike. The presence of the cat statue and the castle-like structure suggests that this location could be a themed restaurant or a store that sells items related to fantasy or children's entertainment 101

2.huggeningface LlavaForConditionalGeneration inference result : desc the image in detail ASSISTANT: The image is a detailed depiction of a whimsical, fantasy-themed restaurant. The entrance is adorned with a large, cartoonish cat statue, which is a playful and eye-catching feature. The restaurant's interior is designed to evoke a sense of adventure and fantasy, with elements such as a castle-like structure, a mushroom, and a dragon. The use of vibrant colors and whimsical decorations creates a lively and engaging atmosphere for guests. The presence of a menu suggests that the restaurant serves food, and the overall design indicates that it is a themed dining establishment, likely catering to families and those seeking a fun and imaginative dining experience. 102

3.vllm llava inference result: The image depicts a lively street scene in a European city. The street is bustling with activity, featuring a diverse group of people engaged in various activities. Some individuals are walking, while others are interacting with each other. The presence of street vendors and the variety of shops along the street contribute to the vibrant atmosphere.

The architecture of the buildings is distinctly European, with a mix of modern and traditional styles. The street is well-lit, with street lamps and natural light from the sky. The overall scene captures the essence of a busy, culturally rich urban environment. 103

raw image

code

huggeningface

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
import time
# model_id="/media/star/8T/model/gpt/llava/llava-hf/llava-1.5-7b-hf"
model_id="/media/star/disk2/model/llava/finetune/llava_lora_deepspeed_20240512_2220/llava_merged_model_fp16_hf"
prompt = "USER: <image>\ndesc the image in detail \nASSISTANT:"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
    # use_flash_attention_2=True,
    attn_implementation="flash_attention_2"
).to(0)

processor = AutoProcessor.from_pretrained(model_id)

vllm

import argparse
import os
import subprocess
from PIL import Image
import torch
from PIL import Image

from vllm import LLM
from vllm.sequence import MultiModalData
from vllm import SamplingParams
# model_path="/media/star/8T/model/gpt/llava/llava-hf/llava-1.5-7b-hf"
model_path="/media/star/disk2/model/llava/finetune/llava_lora_deepspeed_20240512_2220/llava_merged_model_fp16_hf"
llm = LLM(
        model= model_path,
        image_input_type="pixel_values",
        image_token_id=32000,
        image_input_shape="1,3,336,336",
        image_feature_size=576,
        gpu_memory_utilization=0.3, 
        swap_space=8
    )
from transformers import CLIPVisionModel, CLIPImageProcessor, CLIPVisionConfig
image_processor= CLIPImageProcessor.from_pretrained("/media/star/8T/model/clip/openai_clip/clip-vit-large-patch14-336")
image_path="/media/star/8T/tmp/gpt4v/1/8.png"
image_data = Image.open(image_path).convert("RGB")

image_tensor = image_processor.preprocess(image_data, return_tensors='pt')['pixel_values'].half().to("cuda")
print(image_tensor.shape)

question="desc the image in detail "
prompt = "<image>" * 576 + (
        f"\n USER: desc the image in detail \nASSISTANT:")
sampling_params = SamplingParams(temperature=1, top_p=0.01,max_tokens=1024)

RequestOutput = llm.generate(prompt,sampling_params,
                           multi_modal_data=MultiModalData(
                               type=MultiModalData.Type.IMAGE, data=image_tensor))
print(RequestOutput[0].outputs[0].text)

DarkLight1337 commented 5 months ago

At this stage, you have to preprocess the image (using LLavaProcessor from HuggingFace) before feeding it into vLLM. Support for automatic image preprocessing is WIP (#4197).

AmazDeng commented 5 months ago

At this stage, you have to preprocess the image (using LLavaProcessor from HuggingFace) before feeding it into vLLM. Support for automatic image preprocessing is WIP (#4197).

Could you provide me with some sample code? Because I couldn't find the LLavaProcessor class in the Hugging Face Transformers. Additionally, the usage you mentioned in the llava_example is also not available (https://docs.vllm.ai/en/latest/getting_started/examples/llava_example.html).

DarkLight1337 commented 5 months ago

At this stage, you have to preprocess the image (using LLavaProcessor from HuggingFace) before feeding it into vLLM. Support for automatic image preprocessing is WIP (#4197).

I miscapitalized the name, it should be LlavaProcessor.

DarkLight1337 commented 5 months ago

Additionally, the usage you mentioned in the llava_example is also not available

The existing example was created without regard to image processing.

You can use AutoProcessor.from_pretrained(model_name) to load the processor and use it to preprocess the image. For example (not tested):

from PIL import Image
from transformers import AutoProcessor

prompt = ...  # Same as before

image = Image.open("images/stop_sign.jpg")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
inputs_dict = processor(images=[image], return_tensors="pt")
image_input = inputs_dict["pixel_values"]

outputs = llm.generate(prompt,
                       multi_modal_data=MultiModalData(type=MultiModalData.Type.IMAGE,
                                                       data=image_input))

AmazDeng commented 5 months ago

At this stage, you have to preprocess the image (using LLavaProcessor from HuggingFace) before feeding it into vLLM. Support for automatic image preprocessing is WIP (#4197).

I miscapitalized the name, it should be LlavaProcessor.

The code you provided has some issues; it throws an error related to the processor, specifically ValueError: You need to specify either 'text' or 'text_target'. I made some modifications to the code; you can take a look. It runs without errors, but the predicted results remain the same. Additionally, I observed that the image tensor processed by the processor and the one obtained from CLIPImageProcessor.from_pretrained are identical. They are also the same as the image tensor obtained when using LlavaForConditionalGeneration from Hugging Face for inference. Therefore, I don't believe the issue lies with the input. When using VLLM for inference, most of the image descriptions are correct, with only a few being incorrect. If there were issues with the image embeddings, most of the descriptions would likely be incorrect.

101

huggeningface inference image tensor

102

vllm inference image tensor

103

DarkLight1337 commented 5 months ago

Make sure that you're using consistent sampling parameters (e.g. temperature, greedy decoding) for the two implementations. The most simple way is to set temperature=0 so that the result will not be affected by randomness.

AmazDeng commented 5 months ago

Make sure that you're using consistent sampling parameters (e.g. temperature, greedy decoding) for the two implementations. The most simple way is to set temperature=0 so that the result will not be affected by randomness.

The result is same above. 101

DarkLight1337 commented 5 months ago

Looking into your code further, it seems that you are using different prompts. In your HuggingFace test, you placed the image before the question, but in the vLLM test, you placed the image after the question.

AmazDeng commented 5 months ago

Looking into your code further, it seems that you are using different prompts. In your HuggingFace test, you placed the image before the question, but in the vLLM test, you placed the image after the question.

These two prompt formats are provided by the official documentation, not defined by me. My task is to keep the text part of them consistent.

So how should I modify it to make the two prompts consistent? Could you give me a hint?

huggeningface prompt: "USER: <image> \n desc the image in detail \nASSISTANT:" source website: https://huggingface.co/llava-hf/llava-1.5-7b-hf

vllm prompt: "<image>" * 576 + (f"\n USER: desc the image in detail \nASSISTANT:") source website: https://docs.vllm.ai/en/latest/getting_started/examples/llava_example.html

DarkLight1337 commented 5 months ago

Looking into your code further, it seems that you are using different prompts. In your HuggingFace test, you placed the image before the question, but in the vLLM test, you placed the image after the question.

These two prompt formats are provided by the official documentation, not defined by me. My task is to keep the text part of them consistent.

So how should I modify it to make the two prompts consistent? Could you give me a hint?

huggeningface prompt: "USER: <image> \n desc the image in detail \nASSISTANT:" source website: https://huggingface.co/llava-hf/llava-1.5-7b-hf

vllm prompt: "<image>" * 576 + (f"\n USER: desc the image in detail \nASSISTANT:") source website: https://docs.vllm.ai/en/latest/getting_started/examples/llava_example.html

The model in vLLM is loaded from HuggingFace, so you should use the same prompt format as the one shown in HuggingFace. In this sense, the example in vLLM is inaccurate.

AmazDeng commented 5 months ago

Looking into your code further, it seems that you are using different prompts. In your HuggingFace test, you placed the image before the question, but in the vLLM test, you placed the image after the question.

These two prompt formats are provided by the official documentation, not defined by me. My task is to keep the text part of them consistent. So how should I modify it to make the two prompts consistent? Could you give me a hint? huggeningface prompt: "USER: <image> \n desc the image in detail \nASSISTANT:" source website: https://huggingface.co/llava-hf/llava-1.5-7b-hf vllm prompt: "<image>" * 576 + (f"\n USER: desc the image in detail \nASSISTANT:") source website: https://docs.vllm.ai/en/latest/getting_started/examples/llava_example.html

The model in vLLM is loaded from HuggingFace, so you should use the same prompt format as the one shown in HuggingFace. In this sense, the example in vLLM is inaccurate.

Directly inputting the Hugging Face prompt into VLLM won't work; it will throw an error.It seems that I must construct the prompt according to the example provided by VLLM.

Additionally, I'd like to mention that it would be best if VLLM could support direct input of input_embs, as it would require minimal changes to the source code. Moreover, it would have the broadest applicability. Creating corresponding classes for each new multimodal model is quite cumbersome, considering the diverse structures of multimodal models.

101

DarkLight1337 commented 5 months ago

Directly inputting the Hugging Face prompt into VLLM won't work; it will throw an error.It seems that I must construct the prompt according to the example provided by VLLM.

Additionally, I'd like to mention that it would be best if VLLM could support direct input of input_embs, as it would require minimal changes to the source code. Moreover, it would have the broadest applicability. Creating corresponding classes for each new multimodal model is quite cumbersome, considering the diverse structures of multimodal models.

Sorry I forgot to mention, there is one change that should be made in vLLM, which is to repeat the <image> token 576 times. But otherwise the prompt should be the same.

AmazDeng commented 5 months ago

Directly inputting the Hugging Face prompt into VLLM won't work; it will throw an error.It seems that I must construct the prompt according to the example provided by VLLM. Additionally, I'd like to mention that it would be best if VLLM could support direct input of input_embs, as it would require minimal changes to the source code. Moreover, it would have the broadest applicability. Creating corresponding classes for each new multimodal model is quite cumbersome, considering the diverse structures of multimodal models.

Sorry I forgot to mention, there is one change that should be made in vLLM, which is to repeat the <image> token 576 times. But otherwise the prompt should be the same.

I have transformed the prompts according to your suggestions. The results appear to be normal. However,I encountered another problem. I created two prompts, namely: "prompt1,prompt2" and tested them on three images. Below are the test results. I noticed that for the same image(image2, etc.),the descriptions generated by HF and VLLM sometimes are completely identical when using prompt1, and are different when using prompt2.
What could be the reason for this? For the same image and prompt, should the inference results from HF and VLLM be exactly the same, or is it expected that there might be slight differences?

prompt 1

hf_prompt = "USER: <image>\ndesc the image in detail \nASSISTANT:" vllm_prompt== f"USER: {'<image>'*576}\ndesc the image in detail \nASSISTANT:"

image1 desc

hf:The image depicts a scene of an overturned wheelchair on a staircase. The chair is positioned on its side, with the wheels facing upwards and the seat and backrest in a tilted position. The staircase has a metal railing on the right side, and the steps are visible in the background. The lighting in the scene suggests an indoor environment with artificial light. There are no people visible in the image, and the focus is solely on the wheelchair and the staircase. The image does not provide any context or background story, and it is not possible to determine the circumstances that led to the wheelchair's overturned state

vllm:The image depicts a scene of an overturned wheelchair on a staircase. The chair is positioned on its side, with the wheels facing upwards and the seat and backrest in a tilted position. The staircase has a metal railing on the right side, and the steps are visible in the background. The lighting in the scene suggests an indoor environment with artificial light. There are no people visible in the image, and the focus is solely on the wheelchair and the staircase. The image does not provide any context or background story, and it is not possible to determine the circumstances that led to the wheelchair's overturned state.

image2 desc

hf:The image is a detailed depiction of a whimsical, fantasy-themed restaurant. The entrance is adorned with a large, cartoonish cat statue, which is a playful and eye-catching feature. The restaurant's interior is designed to evoke a sense of adventure and fantasy, with elements such as a castle-like structure, a mushroom, and a dragon. The use of vibrant colors and imaginative decorations creates a lively and engaging atmosphere for guests. The presence of a menu suggests that the restaurant serves food, and the overall design indicates that it is a themed dining establishment, likely catering to families and those seeking a fun and immersive dining experience

vllm:The image depicts a whimsical and colorful indoor scene that appears to be a themed restaurant or a playful dining area. The main focus is a large, cartoonish cat statue, which is the centerpiece of the room. This cat statue is dressed in a costume, complete with a ruffled collar and a star-shaped badge, adding to its playful and anthropomorphic appearance.

The room is decorated with various elements that contribute to the fantasy theme. There are mushroom-shaped decorations on the walls, which are common in fantasy and fairy tale settings. The doorway is framed by a wooden arch, and the door itself is adorned with a sign that seems to be in a foreign language, possibly Japanese, given the characters.

The lighting in the room is soft and warm, which enhances the cozy and inviting atmosphere. The overall impression is one of a fun and imaginative space, likely designed to entertain and delight guests.

image3 desc

hf:The image is a detailed aerial view of a large amusement park. The park is characterized by its colorful and playful design, with various rides and attractions spread throughout the area. The central feature of the park is a large, circular water park with a swimming pool and water slides. Surrounding this are several other attractions, including roller coasters, a Ferris wheel, and a variety of other rides. The park is well-maintained, with clear pathways and well-defined areas for different attractions. The presence of palm trees and other greenery adds to the park's vibrant and inviting atmosphere. The image does not provide any information about the park's location or the time of day, but it does offer a comprehensive view of the park's layout and the variety of attractions it offers

vllm:The image is a detailed aerial view of a large amusement park. The park is characterized by its colorful and playful design, with various rides and attractions spread throughout the area. The central feature of the park is a large, circular water park with a swimming pool and water slides. Surrounding this are several other attractions, including roller coasters, a Ferris wheel, and a variety of other rides. The park is well-maintained, with clear pathways and well-defined areas for different attractions. The presence of palm trees and other greenery adds to the park's vibrant and inviting atmosphere. The image does not provide any information about the park's location or the time of day, but it does offer a comprehensive view of the park's layout and the variety of attractions it offers

prompt2

hf_prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>\ndesc the image in detail \nASSISTANT:"

vllm_prompt = f"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: {'<image>'*576}\ndesc the image in detail \nASSISTANT:"

image2 desc

hf:The image is a detailed depiction of a whimsical, fantasy-themed restaurant or café. The entrance is designed to resemble a castle, complete with a stone facade and a thatched roof, which is a common motif in fairy tale and fantasy settings. The sign above the door is written in a language that appears to be Chinese, suggesting that the establishment may be located in a Chinese-speaking region or cater to a Chinese clientele. The interior is not fully visible, but the doorway is adorned with a mural of a castle, which continues the fantasy theme.

The main attraction in the image is the large, plush cat statue positioned in front of the entrance. This statue is a playful and eye-catching feature that likely serves as a mascot for the establishment. The cat's design is reminiscent of a classic cartoon character, which adds to the overall playful and imaginative atmosphere of the place. The cat's position in front of the entrance suggests that it is meant to be a welcoming figure, drawing customers in and adding to the whimsical charm of the setting.

vllm:The image is a detailed depiction of a whimsical, fantasy-themed restaurant or café. The entrance is designed to resemble a castle, complete with a stone facade and a thatched roof, which is a common motif in fairy tale and fantasy settings. The sign above the door is written in a language that appears to be Chinese, suggesting that the establishment may be located in a Chinese-speaking region or cater to a Chinese clientele. The interior is not fully visible, but the doorway is adorned with a mural of a castle, which continues the fantasy theme.

The main attraction in the image is the large, plush cat statue positioned in front of the entrance. This statue is a playful and eye-catching feature that likely serves as a mascot for the establishment. The cat's design is reminiscent of a classic cartoon character, which adds to the overall playful and imaginative atmosphere of the place. The cat's position in front of the entrance suggests that it is meant to be a welcoming figure, drawing customers in and adding to the whimsical charm of the setting.

key codes for prompt1

#hf
start=time.time()
prompt = "USER: <image>\ndesc the image in detail  \nASSISTANT:"
url = "/media/star/8T/tmp/gpt4v/1/11.png"
image = Image.open(url).convert('RGB')
inputs = processor(prompt, image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=2000, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
print(time.time()-start)

#vllm
prompt = f"USER: {'<image>'*576}\ndesc the image in detail  \nASSISTANT:"
image = Image.open("/media/star/8T/tmp/gpt4v/1/8.png").convert('RGB')
inputs = processor(prompt, image, return_tensors='pt').to(0, torch.float16)
image_tensor=inputs["pixel_values"]
# print(image_tensor.shape) #torch.Size([1, 3, 336, 336])

sampling_params = SamplingParams(temperature=0,max_tokens=2000)

RequestOutput = llm.generate(prompt,sampling_params,
                           multi_modal_data=MultiModalData(
                               type=MultiModalData.Type.IMAGE, data=image_tensor))
print(RequestOutput[0].outputs[0].text)

key codes for prompt2

#hf
start=time.time()
prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>\ndesc the image in detail \nASSISTANT:"
url = "/media/star/8T/tmp/gpt4v/1/8.png"
image = Image.open(url).convert('RGB')
inputs = processor(prompt, image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=2000, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
print(time.time()-start)

#vllm
prompt = f"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: {'<image>'*576}\ndesc the image in detail  \nASSISTANT:"
image = Image.open("/media/star/8T/tmp/gpt4v/1/8.png").convert('RGB')
inputs = processor(prompt, image, return_tensors='pt').to(0, torch.float16)
image_tensor=inputs["pixel_values"]

sampling_params = SamplingParams(temperature=0,max_tokens=2000)

RequestOutput = llm.generate(prompt,sampling_params,
                           multi_modal_data=MultiModalData(
                               type=MultiModalData.Type.IMAGE, data=image_tensor))
print(RequestOutput[0].outputs[0].text)

image1

image2

image3

DarkLight1337 commented 5 months ago

In prompt2 you have an extra whitespace in the vLLM prompt after "desc the image in detail". Perhaps that affected the result.

AmazDeng commented 5 months ago

In prompt2 you have an extra whitespace in the vLLM prompt after "desc the image in detail". Perhaps that affected the result.

The image descriptions generated by HF and VLLM for prompt 2 are the same, but they differ for prompt 1. Please see the results below. I removed the trailing space from prompt 1. Again,for the same image and prompt, should the inference results from HF and VLLM be exactly the same, or is it expected that there might be slight differences?

hf

101

vllm

102

DarkLight1337 commented 5 months ago

The image descriptions generated by HF and VLLM for prompt 2 are the same, but they differ for prompt 1. Please see the results below. I removed the trailing space from prompt 1. Again, for the same image and prompt, should the inference results from HF and VLLM be exactly the same, or is it expected that there might be slight differences?

I guess some difference is unavoidable due to floating-point errors especially for long sequences. However, the output at the beginning should be the same. Perhaps those who are more familiar with the internals of vLLM can help answer this question. @rkooo567 any thoughts?

AmazDeng commented 5 months ago

The image descriptions generated by HF and VLLM for prompt 2 are the same, but they differ for prompt 1. Please see the results below. I removed the trailing space from prompt 1. Again, for the same image and prompt, should the inference results from HF and VLLM be exactly the same, or is it expected that there might be slight differences?

I guess some difference is unavoidable due to floating-point errors especially for long sequences. However, the output at the beginning should be the same. Perhaps those who are more familiar with the internals of vLLM can help answer this question. @rkooo567 any thoughts?

I tested 13 images using prompt1 and prompt2 with VLLM and HF separately. Here are the results:

For prompt1, there are significant differences between VLLM and HF, although a few images are exactly the same. As you mentioned, the sequences start the same but diverge later.

For prompt2, the results from VLLM and HF are nearly identical.

Considering that prompt2 is fine-tuned with Vicuna and aligns with its prompt logic, while prompt1 does not, I suspect this difference might be related to the prompt logic.

As for whether the inference results from HF and VLLM should be exactly the same for the same image and prompt, I believe they should be conditionally the same, but not always identical, as seen with prompt1. The exact answer depends on the internal implementation logic of VLLM and HF.

Additionally, I'd like to inquire whether LLAVA in VLLM can accept text-only input instead of images. This way, I could leverage LLAVA's language modeling capabilities. I recall that the original LLAVA project also conducted fine-tuning on purely NLP datasets. If this is possible, it would allow me to avoid deploying an additional online machine just for a language modeling service.

DarkLight1337 commented 5 months ago

Additionally, I'd like to inquire whether LLAVA in VLLM can accept text-only input instead of images. This way, I could leverage LLAVA's language modeling capabilities. I recall that the original LLAVA project also conducted fine-tuning on purely NLP datasets. If this is possible, it would allow me to avoid deploying an additional online machine just for a language modeling service.

Yes, LLaVA on vLLM can handle text-only input.

AmazDeng commented 5 months ago

Additionally, I'd like to inquire whether LLAVA in VLLM can accept text-only input instead of images. This way, I could leverage LLAVA's language modeling capabilities. I recall that the original LLAVA project also conducted fine-tuning on purely NLP datasets. If this is possible, it would allow me to avoid deploying an additional online machine just for a language modeling service.

Yes, LLaVA on vLLM can handle text-only input.

Is this the correct usage pattern? 102

101

DarkLight1337 commented 5 months ago

Is this the correct usage pattern?

Yes, just use it like a regular text-only LLM.

AmazDeng commented 5 months ago

Is this the correct usage pattern?

Yes, just use it like a regular text-only LLM.

Okay. Thank you for your continuous support and assistance. Best wishes to you.

DarkLight1337 commented 5 months ago

Okay. Thank you for your continuous support and assistance. Best wishes to you.

Glad to help!

DarkLight1337 commented 5 months ago

If you don't have further questions, please close this issue.

vllm-project / vllm

[Bug]: llava inference result is wrong ! #4831

Your current environment

🐛 Describe the bug

result

raw image

code

huggeningface

vllm

huggeningface inference image tensor

vllm inference image tensor

prompt 1

prompt2

image1

image2

image3

hf

vllm