Closed AmazDeng closed 5 months ago
At this stage, you have to preprocess the image (using LLavaProcessor
from HuggingFace) before feeding it into vLLM. Support for automatic image preprocessing is WIP (#4197).
At this stage, you have to preprocess the image (using
LLavaProcessor
from HuggingFace) before feeding it into vLLM. Support for automatic image preprocessing is WIP (#4197).
Could you provide me with some sample code? Because I couldn't find the LLavaProcessor class in the Hugging Face Transformers. Additionally, the usage you mentioned in the llava_example is also not available (https://docs.vllm.ai/en/latest/getting_started/examples/llava_example.html).
At this stage, you have to preprocess the image (using
LLavaProcessor
from HuggingFace) before feeding it into vLLM. Support for automatic image preprocessing is WIP (#4197).
I miscapitalized the name, it should be LlavaProcessor
.
Additionally, the usage you mentioned in the llava_example is also not available
The existing example was created without regard to image processing.
You can use AutoProcessor.from_pretrained(model_name)
to load the processor and use it to preprocess the image. For example (not tested):
from PIL import Image
from transformers import AutoProcessor
prompt = ... # Same as before
image = Image.open("images/stop_sign.jpg")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
inputs_dict = processor(images=[image], return_tensors="pt")
image_input = inputs_dict["pixel_values"]
outputs = llm.generate(prompt,
multi_modal_data=MultiModalData(type=MultiModalData.Type.IMAGE,
data=image_input))
At this stage, you have to preprocess the image (using
LLavaProcessor
from HuggingFace) before feeding it into vLLM. Support for automatic image preprocessing is WIP (#4197).I miscapitalized the name, it should be
LlavaProcessor
.
The code you provided has some issues; it throws an error related to the processor, specifically ValueError: You need to specify either 'text' or 'text_target'. I made some modifications to the code; you can take a look. It runs without errors, but the predicted results remain the same. Additionally, I observed that the image tensor processed by the processor and the one obtained from CLIPImageProcessor.from_pretrained are identical. They are also the same as the image tensor obtained when using LlavaForConditionalGeneration from Hugging Face for inference. Therefore, I don't believe the issue lies with the input. When using VLLM for inference, most of the image descriptions are correct, with only a few being incorrect. If there were issues with the image embeddings, most of the descriptions would likely be incorrect.
Make sure that you're using consistent sampling parameters (e.g. temperature, greedy decoding) for the two implementations. The most simple way is to set temperature=0
so that the result will not be affected by randomness.
Make sure that you're using consistent sampling parameters (e.g. temperature, greedy decoding) for the two implementations. The most simple way is to set
temperature=0
so that the result will not be affected by randomness.
The result is same above.
Looking into your code further, it seems that you are using different prompts. In your HuggingFace test, you placed the image before the question, but in the vLLM test, you placed the image after the question.
Looking into your code further, it seems that you are using different prompts. In your HuggingFace test, you placed the image before the question, but in the vLLM test, you placed the image after the question.
These two prompt formats are provided by the official documentation, not defined by me. My task is to keep the text part of them consistent.
So how should I modify it to make the two prompts consistent? Could you give me a hint?
huggeningface prompt: "USER: <image>
\n desc the image in detail \nASSISTANT:"
source website: https://huggingface.co/llava-hf/llava-1.5-7b-hf
vllm prompt: "<image>
" * 576 + (f"\n USER: desc the image in detail \nASSISTANT:")
source website: https://docs.vllm.ai/en/latest/getting_started/examples/llava_example.html
Looking into your code further, it seems that you are using different prompts. In your HuggingFace test, you placed the image before the question, but in the vLLM test, you placed the image after the question.
These two prompt formats are provided by the official documentation, not defined by me. My task is to keep the text part of them consistent.
So how should I modify it to make the two prompts consistent? Could you give me a hint?
huggeningface prompt: "USER:
<image>
\n desc the image in detail \nASSISTANT:" source website: https://huggingface.co/llava-hf/llava-1.5-7b-hfvllm prompt: "
<image>
" * 576 + (f"\n USER: desc the image in detail \nASSISTANT:") source website: https://docs.vllm.ai/en/latest/getting_started/examples/llava_example.html
The model in vLLM is loaded from HuggingFace, so you should use the same prompt format as the one shown in HuggingFace. In this sense, the example in vLLM is inaccurate.
Looking into your code further, it seems that you are using different prompts. In your HuggingFace test, you placed the image before the question, but in the vLLM test, you placed the image after the question.
These two prompt formats are provided by the official documentation, not defined by me. My task is to keep the text part of them consistent. So how should I modify it to make the two prompts consistent? Could you give me a hint? huggeningface prompt: "USER:
<image>
\n desc the image in detail \nASSISTANT:" source website: https://huggingface.co/llava-hf/llava-1.5-7b-hf vllm prompt: "<image>
" * 576 + (f"\n USER: desc the image in detail \nASSISTANT:") source website: https://docs.vllm.ai/en/latest/getting_started/examples/llava_example.htmlThe model in vLLM is loaded from HuggingFace, so you should use the same prompt format as the one shown in HuggingFace. In this sense, the example in vLLM is inaccurate.
Directly inputting the Hugging Face prompt into VLLM won't work; it will throw an error.It seems that I must construct the prompt according to the example provided by VLLM.
Additionally, I'd like to mention that it would be best if VLLM could support direct input of input_embs, as it would require minimal changes to the source code. Moreover, it would have the broadest applicability. Creating corresponding classes for each new multimodal model is quite cumbersome, considering the diverse structures of multimodal models.
Directly inputting the Hugging Face prompt into VLLM won't work; it will throw an error.It seems that I must construct the prompt according to the example provided by VLLM.
Additionally, I'd like to mention that it would be best if VLLM could support direct input of input_embs, as it would require minimal changes to the source code. Moreover, it would have the broadest applicability. Creating corresponding classes for each new multimodal model is quite cumbersome, considering the diverse structures of multimodal models.
Sorry I forgot to mention, there is one change that should be made in vLLM, which is to repeat the <image>
token 576 times. But otherwise the prompt should be the same.
Directly inputting the Hugging Face prompt into VLLM won't work; it will throw an error.It seems that I must construct the prompt according to the example provided by VLLM. Additionally, I'd like to mention that it would be best if VLLM could support direct input of input_embs, as it would require minimal changes to the source code. Moreover, it would have the broadest applicability. Creating corresponding classes for each new multimodal model is quite cumbersome, considering the diverse structures of multimodal models.
Sorry I forgot to mention, there is one change that should be made in vLLM, which is to repeat the
<image>
token 576 times. But otherwise the prompt should be the same.
I have transformed the prompts according to your suggestions. The results appear to be normal.
However,I encountered another problem.
I created two prompts, namely: "prompt1,prompt2" and tested them on three images. Below are the test results. I noticed that for the same image(image2, etc.),the descriptions generated by HF and VLLM sometimes are completely identical when using prompt1, and are different when using prompt2.
What could be the reason for this? For the same image and prompt, should the inference results from HF and VLLM be exactly the same, or is it expected that there might be slight differences?
hf_prompt = "USER: <image>
\ndesc the image in detail \nASSISTANT:"
vllm_prompt== f"USER: {'<image>'
*576}\ndesc the image in detail \nASSISTANT:"
image1 desc
hf:The image depicts a scene of an overturned wheelchair on a staircase. The chair is positioned on its side, with the wheels facing upwards and the seat and backrest in a tilted position. The staircase has a metal railing on the right side, and the steps are visible in the background. The lighting in the scene suggests an indoor environment with artificial light. There are no people visible in the image, and the focus is solely on the wheelchair and the staircase. The image does not provide any context or background story, and it is not possible to determine the circumstances that led to the wheelchair's overturned state
vllm:The image depicts a scene of an overturned wheelchair on a staircase. The chair is positioned on its side, with the wheels facing upwards and the seat and backrest in a tilted position. The staircase has a metal railing on the right side, and the steps are visible in the background. The lighting in the scene suggests an indoor environment with artificial light. There are no people visible in the image, and the focus is solely on the wheelchair and the staircase. The image does not provide any context or background story, and it is not possible to determine the circumstances that led to the wheelchair's overturned state.
image2 desc
hf:The image is a detailed depiction of a whimsical, fantasy-themed restaurant. The entrance is adorned with a large, cartoonish cat statue, which is a playful and eye-catching feature. The restaurant's interior is designed to evoke a sense of adventure and fantasy, with elements such as a castle-like structure, a mushroom, and a dragon. The use of vibrant colors and imaginative decorations creates a lively and engaging atmosphere for guests. The presence of a menu suggests that the restaurant serves food, and the overall design indicates that it is a themed dining establishment, likely catering to families and those seeking a fun and immersive dining experience
vllm:The image depicts a whimsical and colorful indoor scene that appears to be a themed restaurant or a playful dining area. The main focus is a large, cartoonish cat statue, which is the centerpiece of the room. This cat statue is dressed in a costume, complete with a ruffled collar and a star-shaped badge, adding to its playful and anthropomorphic appearance.
The room is decorated with various elements that contribute to the fantasy theme. There are mushroom-shaped decorations on the walls, which are common in fantasy and fairy tale settings. The doorway is framed by a wooden arch, and the door itself is adorned with a sign that seems to be in a foreign language, possibly Japanese, given the characters.
The lighting in the room is soft and warm, which enhances the cozy and inviting atmosphere. The overall impression is one of a fun and imaginative space, likely designed to entertain and delight guests.
image3 desc
hf:The image is a detailed aerial view of a large amusement park. The park is characterized by its colorful and playful design, with various rides and attractions spread throughout the area. The central feature of the park is a large, circular water park with a swimming pool and water slides. Surrounding this are several other attractions, including roller coasters, a Ferris wheel, and a variety of other rides. The park is well-maintained, with clear pathways and well-defined areas for different attractions. The presence of palm trees and other greenery adds to the park's vibrant and inviting atmosphere. The image does not provide any information about the park's location or the time of day, but it does offer a comprehensive view of the park's layout and the variety of attractions it offers
vllm:The image is a detailed aerial view of a large amusement park. The park is characterized by its colorful and playful design, with various rides and attractions spread throughout the area. The central feature of the park is a large, circular water park with a swimming pool and water slides. Surrounding this are several other attractions, including roller coasters, a Ferris wheel, and a variety of other rides. The park is well-maintained, with clear pathways and well-defined areas for different attractions. The presence of palm trees and other greenery adds to the park's vibrant and inviting atmosphere. The image does not provide any information about the park's location or the time of day, but it does offer a comprehensive view of the park's layout and the variety of attractions it offers
hf_prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>
\ndesc the image in detail \nASSISTANT:"
vllm_prompt = f"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: {'<image>'
*576}\ndesc the image in detail \nASSISTANT:"
image2 desc
hf:The image is a detailed depiction of a whimsical, fantasy-themed restaurant or café. The entrance is designed to resemble a castle, complete with a stone facade and a thatched roof, which is a common motif in fairy tale and fantasy settings. The sign above the door is written in a language that appears to be Chinese, suggesting that the establishment may be located in a Chinese-speaking region or cater to a Chinese clientele. The interior is not fully visible, but the doorway is adorned with a mural of a castle, which continues the fantasy theme.
The main attraction in the image is the large, plush cat statue positioned in front of the entrance. This statue is a playful and eye-catching feature that likely serves as a mascot for the establishment. The cat's design is reminiscent of a classic cartoon character, which adds to the overall playful and imaginative atmosphere of the place. The cat's position in front of the entrance suggests that it is meant to be a welcoming figure, drawing customers in and adding to the whimsical charm of the setting.
vllm:The image is a detailed depiction of a whimsical, fantasy-themed restaurant or café. The entrance is designed to resemble a castle, complete with a stone facade and a thatched roof, which is a common motif in fairy tale and fantasy settings. The sign above the door is written in a language that appears to be Chinese, suggesting that the establishment may be located in a Chinese-speaking region or cater to a Chinese clientele. The interior is not fully visible, but the doorway is adorned with a mural of a castle, which continues the fantasy theme.
The main attraction in the image is the large, plush cat statue positioned in front of the entrance. This statue is a playful and eye-catching feature that likely serves as a mascot for the establishment. The cat's design is reminiscent of a classic cartoon character, which adds to the overall playful and imaginative atmosphere of the place. The cat's position in front of the entrance suggests that it is meant to be a welcoming figure, drawing customers in and adding to the whimsical charm of the setting.
key codes for prompt1
#hf
start=time.time()
prompt = "USER: <image>\ndesc the image in detail \nASSISTANT:"
url = "/media/star/8T/tmp/gpt4v/1/11.png"
image = Image.open(url).convert('RGB')
inputs = processor(prompt, image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=2000, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
print(time.time()-start)
#vllm
prompt = f"USER: {'<image>'*576}\ndesc the image in detail \nASSISTANT:"
image = Image.open("/media/star/8T/tmp/gpt4v/1/8.png").convert('RGB')
inputs = processor(prompt, image, return_tensors='pt').to(0, torch.float16)
image_tensor=inputs["pixel_values"]
# print(image_tensor.shape) #torch.Size([1, 3, 336, 336])
sampling_params = SamplingParams(temperature=0,max_tokens=2000)
RequestOutput = llm.generate(prompt,sampling_params,
multi_modal_data=MultiModalData(
type=MultiModalData.Type.IMAGE, data=image_tensor))
print(RequestOutput[0].outputs[0].text)
key codes for prompt2
#hf
start=time.time()
prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>\ndesc the image in detail \nASSISTANT:"
url = "/media/star/8T/tmp/gpt4v/1/8.png"
image = Image.open(url).convert('RGB')
inputs = processor(prompt, image, return_tensors='pt').to(0, torch.float16)
output = model.generate(**inputs, max_new_tokens=2000, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))
print(time.time()-start)
#vllm
prompt = f"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: {'<image>'*576}\ndesc the image in detail \nASSISTANT:"
image = Image.open("/media/star/8T/tmp/gpt4v/1/8.png").convert('RGB')
inputs = processor(prompt, image, return_tensors='pt').to(0, torch.float16)
image_tensor=inputs["pixel_values"]
sampling_params = SamplingParams(temperature=0,max_tokens=2000)
RequestOutput = llm.generate(prompt,sampling_params,
multi_modal_data=MultiModalData(
type=MultiModalData.Type.IMAGE, data=image_tensor))
print(RequestOutput[0].outputs[0].text)
In prompt2
you have an extra whitespace in the vLLM prompt after "desc the image in detail"
. Perhaps that affected the result.
In
prompt2
you have an extra whitespace in the vLLM prompt after"desc the image in detail"
. Perhaps that affected the result.
The image descriptions generated by HF and VLLM for prompt 2 are the same, but they differ for prompt 1. Please see the results below. I removed the trailing space from prompt 1. Again,for the same image and prompt, should the inference results from HF and VLLM be exactly the same, or is it expected that there might be slight differences?
The image descriptions generated by HF and VLLM for prompt 2 are the same, but they differ for prompt 1. Please see the results below. I removed the trailing space from prompt 1. Again, for the same image and prompt, should the inference results from HF and VLLM be exactly the same, or is it expected that there might be slight differences?
I guess some difference is unavoidable due to floating-point errors especially for long sequences. However, the output at the beginning should be the same. Perhaps those who are more familiar with the internals of vLLM can help answer this question. @rkooo567 any thoughts?
The image descriptions generated by HF and VLLM for prompt 2 are the same, but they differ for prompt 1. Please see the results below. I removed the trailing space from prompt 1. Again, for the same image and prompt, should the inference results from HF and VLLM be exactly the same, or is it expected that there might be slight differences?
I guess some difference is unavoidable due to floating-point errors especially for long sequences. However, the output at the beginning should be the same. Perhaps those who are more familiar with the internals of vLLM can help answer this question. @rkooo567 any thoughts?
I tested 13 images using prompt1 and prompt2 with VLLM and HF separately. Here are the results:
For prompt1, there are significant differences between VLLM and HF, although a few images are exactly the same. As you mentioned, the sequences start the same but diverge later.
For prompt2, the results from VLLM and HF are nearly identical.
Considering that prompt2 is fine-tuned with Vicuna and aligns with its prompt logic, while prompt1 does not, I suspect this difference might be related to the prompt logic.
As for whether the inference results from HF and VLLM should be exactly the same for the same image and prompt, I believe they should be conditionally the same, but not always identical, as seen with prompt1. The exact answer depends on the internal implementation logic of VLLM and HF.
Additionally, I'd like to inquire whether LLAVA in VLLM can accept text-only input instead of images. This way, I could leverage LLAVA's language modeling capabilities. I recall that the original LLAVA project also conducted fine-tuning on purely NLP datasets. If this is possible, it would allow me to avoid deploying an additional online machine just for a language modeling service.
Additionally, I'd like to inquire whether LLAVA in VLLM can accept text-only input instead of images. This way, I could leverage LLAVA's language modeling capabilities. I recall that the original LLAVA project also conducted fine-tuning on purely NLP datasets. If this is possible, it would allow me to avoid deploying an additional online machine just for a language modeling service.
Yes, LLaVA on vLLM can handle text-only input.
Additionally, I'd like to inquire whether LLAVA in VLLM can accept text-only input instead of images. This way, I could leverage LLAVA's language modeling capabilities. I recall that the original LLAVA project also conducted fine-tuning on purely NLP datasets. If this is possible, it would allow me to avoid deploying an additional online machine just for a language modeling service.
Yes, LLaVA on vLLM can handle text-only input.
Is this the correct usage pattern?
Is this the correct usage pattern?
Yes, just use it like a regular text-only LLM.
Is this the correct usage pattern?
Yes, just use it like a regular text-only LLM.
Okay. Thank you for your continuous support and assistance. Best wishes to you.
Okay. Thank you for your continuous support and assistance. Best wishes to you.
Glad to help!
If you don't have further questions, please close this issue.
Your current environment
🐛 Describe the bug
I fine-tuned the llava-v1.5-7b model and saved it in the llava-1.5-7b-hf format (https://huggingface.co/llava-hf/llava-1.5-7b-hf). I compared the inference results of three different methods. One is using the original llava project code for inference, the second is using LlavaForConditionalGeneration from Hugging Face for inference, and the third is using vllm for inference. The results are as follows. Clearly, the inference results of the first two methods are correct, while the inference result of vllm is incorrect. Why does this happen? Hugging Face and vllm use the same model checkpoint ! Furthermore, I compared the weight values inside llava-v1.5-7b and llava-v1.5-7b-hf, and found that the values are identical.
result
1.original llava project inference result: The image depicts a whimsical and colorful scene, likely a part of a themed restaurant or store. The main focus is a large, white cat statue with a blue bow, which appears to be a mascot or decorative figure. The cat is positioned in front of a wooden door, which is part of a larger, castle-like structure. This structure is adorned with various mushroom-shaped decorations, adding to the fantasy theme.\n\nIn the background, there are several other mushroom-shaped decorations, some of which are hanging from the ceiling. The overall atmosphere of the scene is playful and imaginative, likely designed to attract the attention of children and adults alike. The presence of the cat statue and the castle-like structure suggests that this location could be a themed restaurant or a store that sells items related to fantasy or children's entertainment
2.huggeningface LlavaForConditionalGeneration inference result : desc the image in detail ASSISTANT: The image is a detailed depiction of a whimsical, fantasy-themed restaurant. The entrance is adorned with a large, cartoonish cat statue, which is a playful and eye-catching feature. The restaurant's interior is designed to evoke a sense of adventure and fantasy, with elements such as a castle-like structure, a mushroom, and a dragon. The use of vibrant colors and whimsical decorations creates a lively and engaging atmosphere for guests. The presence of a menu suggests that the restaurant serves food, and the overall design indicates that it is a themed dining establishment, likely catering to families and those seeking a fun and imaginative dining experience.
3.vllm llava inference result: The image depicts a lively street scene in a European city. The street is bustling with activity, featuring a diverse group of people engaged in various activities. Some individuals are walking, while others are interacting with each other. The presence of street vendors and the variety of shops along the street contribute to the vibrant atmosphere.
The architecture of the buildings is distinctly European, with a mix of modern and traditional styles. The street is well-lit, with street lamps and natural light from the sky. The overall scene captures the essence of a busy, culturally rich urban environment.
raw image
code
huggeningface
vllm