[Bug]: Llama-3.2-11B-Vision-Instruct Inference Can't Stop

sudanl commented 1 week ago

Your current environment

The output of `python collect_env.py`

```text Your output of `python collect_env.py` here ```

Model Input Dumps

No response

🐛 Describe the bug

Derectly runing: https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py, and set max_tokens in the code as 1024

python offline_inference_vision_language.py --model-type mllama

The model output will continue to the maximum tokens and will not stop early:

Loading safetensors checkpoint shards: 100% Completed | 5/5 [01:31<00:00, 18.32s/it]

INFO 10-28 09:30:16 model_runner.py:1067] Loading model weights took 19.9073 GB
INFO 10-28 09:30:16 enc_dec_model_runner.py:301] Starting profile run for multi-modal models.
INFO 10-28 09:30:21 gpu_executor.py:122] # GPU blocks: 16707, # CPU blocks: 1638
INFO 10-28 09:30:21 gpu_executor.py:126] Maximum concurrency for 4096 tokens per request: 65.26x
WARNING 10-28 09:30:25 preprocess.py:89] Falling back on <BOS> for decoder start token id because decoder start token id is not available.
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:20<00:00, 20.23s/it, est. speed input: 0.54 toks/s, output: 50.61 toks/s]
Output :   The image shows a cherry blossom tree in front of a tall tower. The tree is in full bloom, with pink flowers covering the branches. The tower is white and has a distinctive shape, with a large dome at the top and a narrow base. The sky is blue and clear, suggesting a sunny day. The overall atmosphere of the image is one of beauty and tranquility, with the vibrant colors of the flowers and the towering structure creating a sense of awe and wonder. The image may be intended to evoke feelings of peace and serenity, or to showcase the natural beauty of the cherry blossom tree and the architectural grandeur of the tower. It could also be used to promote tourism or cultural events related to the tower or the cherry blossom season. Overall, the image is a visually striking representation of the intersection of nature and architecture, inviting the viewer to appreciate the beauty of both. The image is likely intended to evoke a sense of wonder and appreciation for the natural world and human ingenuity. It may also be used to promote cultural events or tourism related to the tower or the cherry blossom season. The image is a beautiful representation of the intersection of nature and architecture, inviting the viewer to appreciate the beauty of both. The image is likely intended to evoke a sense of wonder and appreciation for the natural world and human ingenuity. It may also be used to promote cultural events or tourism related to the tower or the cherry blossom season. The image is a beautiful representation of the intersection of nature and architecture, inviting the viewer to appreciate the beauty of both. The image is likely intended to evoke a sense of wonder and appreciation for the natural world and human ingenuity. It may also be used to promote cultural events or tourism related to the tower or the cherry blossom season. The image is a beautiful representation of the intersection of nature and architecture, inviting the viewer to appreciate the beauty of both. The image is likely intended to evoke a sense of wonder and appreciation for the natural world and human ingenuity. It may also be used to promote cultural events or tourism related to the tower or the cherry blossom season. The image is a beautiful representation of the intersection of nature and architecture, inviting the viewer to appreciate the beauty of both. The image is likely intended to evoke a sense of wonder and appreciation for the natural world and human ingenuity. It may also be used to promote cultural events or tourism related to the tower or the cherry blossom season. The image is a beautiful representation of the intersection of nature and architecture, inviting the viewer to appreciate the beauty of both. The image is likely intended to evoke a sense of wonder and appreciation for the natural world and human ingenuity. It may also be used to promote cultural events or tourism related to the tower or the cherry blossom season. The image is a beautiful representation of the intersection of nature and architecture, inviting the viewer to appreciate the beauty of both. The image is likely intended to evoke a sense of wonder and appreciation for the natural world and human ingenuity. It may also be used to promote cultural events or tourism related to the tower or the cherry blossom season. The image is a beautiful representation of the intersection of nature and architecture, inviting the viewer to appreciate the beauty of both. The image is likely intended to evoke a sense of wonder and appreciation for the natural world and human ingenuity. It may also be used to promote cultural events or tourism related to the tower or the cherry blossom season. The image is a beautiful representation of the intersection of nature and architecture, inviting the viewer to appreciate the beauty of both. The image is likely intended to evoke a sense of wonder and appreciation for the natural world and human ingenuity. It may also be used to promote cultural events or tourism related to the tower or the cherry blossom season. The image is a beautiful representation of the intersection of nature and architecture, inviting the viewer to appreciate the beauty of both. The image is likely intended to evoke a sense of wonder and appreciation for the natural world and human ingenuity. It may also be used to promote cultural events or tourism related to the tower or the cherry blossom season. The image is a beautiful representation of the intersection of nature and architecture, inviting the viewer to appreciate the beauty of both. The image is likely intended to evoke a sense of wonder and appreciation for the natural world and human ingenuity. It may also be used to promote cultural events or tourism related to the tower or the cherry blossom season. The image is a beautiful representation of the intersection of nature and architecture, inviting the viewer to appreciate the beauty of both. The image is likely intended to evoke a sense of wonder and appreciation for the natural world and human ingenuity. It may also be used to promote cultural events or tourism related to the tower or the cherry blossom season...

Is that because the stop_token_ids setting is incorrect?

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 commented 1 week ago

cc @heheda12345

heheda12345 commented 1 week ago

Will huggingface stop with the same prompt & image?

sudanl commented 1 week ago

I run the example scripts on https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct#use-with-transformers. It works well.

When I put the same example image & prompt into https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py, It can't stop.

heheda12345 commented 1 week ago

To check whether it is caused by stop_token_ids, you can try to check whether the output logprobs here is correct. https://github.com/vllm-project/vllm/blob/c5d7fb9ddc16d9eb68f1018cfb384faf3be301be/vllm/model_executor/models/mllama.py#L1078

sudanl commented 1 week ago

Hi, I just updated vllm to the latest code version, and the same problem still occurs. Could you describe in more detail how to check the specific problem of stop_token_ids by outputting the logprobs?

heheda12345 commented 1 week ago

Hope these tips can be helpful for you

This line converts the model output to the generated token, which takes the logits as input https://github.com/vllm-project/vllm/blob/0ad216f5750742115c686723bf38698372d483fd/vllm/worker/model_runner.py#L1689
This function checks whether we should stop based on the generated token https://github.com/vllm-project/vllm/blob/0ad216f5750742115c686723bf38698372d483fd/vllm/engine/output_processor/stop_checker.py#L28
When comparing with hf output, please make sure the SamplingParams are the same.

vllm-project / vllm