Open Chuyun-Shen opened 2 weeks ago
If the only difference between this model and the vanilla LlavaForConditionalGeneration
is the language backbone, you should be able to load the model in vLLM by setting the text_config
in HuggingFace config.json
to load Qwen2 instead of Llama.
Thank you for your reply. Follow your instructions and modify the corresponding JSON file. I have deployed it on my GPU, but I don’t know how to match it with https://github.com/FreedomIntelligence/HuatuoGPT-Vision/blob/. I tested https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference The calling method of offline the printing result is
Prompt: 'Hello, my name is', Generated text: '!!!!!!!!!!!!!!!!'
Prompt: 'The president of the United States is', Generated text: '!!!!!!!!!!!!!!!!!'
Prompt: 'The capital of France is', Generated text: '!!!!!!!!!!!!!!!!!!!'
Prompt: 'The future of AI is', Generated text: '!!!!!!!!!!!!!!!!!'
Need your help!
This sounds a lot like the problem encountered by @fyabc when implementing Qwen2-VL in #7905. In this case, the problem may be inside Qwen2 backbone itself.
By the way, please provide a bit more details about how you modified the JSON file.
By the way, please provide a bit more details about how you modified the JSON file.
I add the text_config
in the config.json file:
{
"_name_or_path": "HuatuoGPT-Vision-7B",
"architectures": [
"LlavaForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"image_aspect_ratio": "pad",
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 131072,
"max_window_layers": 28,
"mm_hidden_size": 1024,
"mm_projector_type": "mlp2x_gelu",
"mm_vision_select_feature": "patch",
"mm_vision_select_layer": -2,
"mm_vision_tower": "./vit/clip_vit_large_patch14_336",
"model_type": "llava_qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": 131072,
"tie_word_embeddings": false,
"tokenizer_model_max_length": 4096,
"tokenizer_padding_side": "right",
"torch_dtype": "bfloat16",
"transformers_version": "4.40.0.dev0",
"tune_mm_mlp_adapter": false,
"use_cache": false,
"use_mm_proj": true,
"use_sliding_window": false,
"vocab_size": 152064,
"text_config": {
"_name_or_path": "Qwen/Qwen2-7B-Instruct",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": 131072,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.41.2",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 152064
}
}
and add a preprocessor_config.json file referring https://huggingface.co/llava-hf/llava-1.5-7b-hf/blob/main/preprocessor_config.json
{
"crop_size": {
"height": 336,
"width": 336
},
"do_center_crop": true,
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type": "CLIPImageProcessor",
"image_std": [
0.26862954,
0.26130258,
0.27577711
],
"processor_class": "LlavaProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"shortest_edge": 336
}
}
Some of the fields in LlavaQwen2Config
actually belong to the language backbone. You may have to replace some of the fields in the original Qwen2Config
with those values.
Thanks for your response. I'm trying to figure out how to use both images and text as input in this case. The code at https://github.com/FreedomIntelligence/HuatuoGPT-Vision/blob/main/cli.py seems to handle this, but it uses a different approach than the VLLM API. Do you have any suggestions or examples of how to combine image paths and text for prompting LLM, ideally compatible with the VLLM framework?
For now, this is best handled by the OpenAI-compatible server which supports multi-modal inputs directly according to OpenAI API spec. The offline LLM.chat
method currently only supports text inputs. If you wish to perform offline inference, you can try to use vllm.entrypoints.chat_utils
to process the multi-modal inputs beforehand.
I try the example in https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123",
)
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
However get nothing in io output. The log shows:
INFO 09-03 11:22:28 logger.py:36] Received request chat-7b8a6a1dfefa41b89eb94b03d4bda69b: prompt: '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131052, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 151645, 198, 151644, 872, 198, 9707, 0, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
INFO 09-03 11:22:28 async_llm_engine.py:205] Added request chat-7b8a6a1dfefa41b89eb94b03d4bda69b.
Has the model finished downloading? The example is using a different model than the one you originally used.
Sorry for leading misunderstanding, I have changed the script for loading my local download model.
Sorry for leading misunderstanding, I have changed the script for loading my local download model.
For detailed, I used CUDA_VISIBLE_DEVICES=2 vllm serve /home/chuyun/hf_hub/HuatuoGPT-Vision-7B --dtype auto --api-key token-abc123 --port 7996
for load HuatuoGPT-Vision-7B
Then I run
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:7996/v1",
api_key="token-abc123",
)
completion = client.chat.completions.create(
model="/home/chuyun/hf_hub/HuatuoGPT-Vision-7B",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
It keeps running without ending and doesn't print any information.
Can the model run in offline mode with just text? What does your JSON config look like now?
offline mode returns:
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00, 174.82it/s]
INFO 09-03 12:30:06 model_runner.py:917] Loading model weights took 14.8443 GB
WARNING 09-03 12:30:06 model_runner.py:1084] Computed max_num_seqs (min(256, 512 // 576)) to be less than 1. Setting it to the minimum value of 1.
INFO 09-03 12:30:09 gpu_executor.py:121] # GPU blocks: 23818, # CPU blocks: 4681
INFO 09-03 12:30:18 model_runner.py:1208] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-03 12:30:18 model_runner.py:1212] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-03 12:30:40 model_runner.py:1327] Graph capturing finished in 22 secs.
Processed prompts: 100%|█████████████| 4/4 [00:00<00:00, 14.00it/s, est. speed input: 77.03 toks/s, output: 224.08 toks/s]
Prompt: 'Hello, my name is', Generated text: '!!!!!!!!!!!!!!!!'
Prompt: 'The president of the United States is', Generated text: '!!!!!!!!!!!!!!!!'
Prompt: 'The capital of France is', Generated text: '!!!!!!!!!!!!!!!!'
Prompt: 'The future of AI is', Generated text: '!!!!!!!!!!!!!!!!'
The config.json is like this:
{
"_name_or_path": "HuatuoGPT-Vision-7B",
"architectures": [
"LlavaForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"image_aspect_ratio": "pad",
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 131072,
"max_window_layers": 28,
"mm_hidden_size": 1024,
"mm_projector_type": "mlp2x_gelu",
"mm_vision_select_feature": "patch",
"mm_vision_select_layer": -2,
"mm_vision_tower": "./vit/clip_vit_large_patch14_336",
"model_type": "llava_qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": 131072,
"tie_word_embeddings": false,
"tokenizer_model_max_length": 4096,
"tokenizer_padding_side": "right",
"torch_dtype": "bfloat16",
"transformers_version": "4.40.0.dev0",
"tune_mm_mlp_adapter": false,
"use_cache": false,
"use_mm_proj": true,
"use_sliding_window": false,
"vocab_size": 152064,
"text_config": {
"_name_or_path": "Qwen/Qwen2-7B-Instruct",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151643,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_theta": 1000000.0,
"sliding_window": 131072,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.40.0.dev0",
"use_cache": false,
"use_sliding_window": false,
"vocab_size": 152064
}
}
I tried client.completions.create
.
The code is shown below:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:7996/v1",
api_key="token-abc123",
)
completion = client.completions.create(model="/home/chuyun/hf_hub/HuatuoGPT-Vision-7B", prompt="San Francisco is a")
print(completion.choices[0])
it returns
CompletionChoice(finish_reason='length', index=0, logprobs=None, text='!!!!!!!!!!!!!!!!', stop_reason=None, prompt_logprobs=None)
For the online inference using Chat Completions API, you may be missing the chat template. Nevertheless, we still have the problem of nonsense output.
Is the model designed for text-only input or is it necessary to input both text and image?
Is the model designed for text-only input or is it necessary to input both text and image?
This model is designed for both image and text, and they provide an offline mode CLI code: https://github.com/FreedomIntelligence/HuatuoGPT-Vision/blob/main/cli.py
I need to input both text and image.
The model to consider.
I am interested in deploying the HuatuoGPT-Vision 7B model, as detailed in the repository HuatuoGPT-Vision. However, I noticed that the model architecture LlavaQwen2ForCausalLM used by HuatuoGPT-Vision-7B is not currently supported by vLLM.
As this model is a multimodal model, it's hard to add it to vllm framework by myself. Could you please consider adding support for this model? If there are any specific challenges or requirements needed to enable this support, I would be happy to assist or provide more information.
Thank you for your attention to this request.
The closest model vllm already supports.
LlavaQwenForCausalLM
What's your difficulty of supporting the model you want?
No response
Before submitting a new issue...