vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.87k stars 3.95k forks source link

[New Model]: LlavaQwen2ForCausalLM #7984

Open Chuyun-Shen opened 2 weeks ago

Chuyun-Shen commented 2 weeks ago

The model to consider.

I am interested in deploying the HuatuoGPT-Vision 7B model, as detailed in the repository HuatuoGPT-Vision. However, I noticed that the model architecture LlavaQwen2ForCausalLM used by HuatuoGPT-Vision-7B is not currently supported by vLLM.

As this model is a multimodal model, it's hard to add it to vllm framework by myself. Could you please consider adding support for this model? If there are any specific challenges or requirements needed to enable this support, I would be happy to assist or provide more information.

Thank you for your attention to this request.

The closest model vllm already supports.

LlavaQwenForCausalLM

What's your difficulty of supporting the model you want?

No response

Before submitting a new issue...

DarkLight1337 commented 2 weeks ago

If the only difference between this model and the vanilla LlavaForConditionalGeneration is the language backbone, you should be able to load the model in vLLM by setting the text_config in HuggingFace config.json to load Qwen2 instead of Llama.

Chuyun-Shen commented 2 weeks ago

Thank you for your reply. Follow your instructions and modify the corresponding JSON file. I have deployed it on my GPU, but I don’t know how to match it with https://github.com/FreedomIntelligence/HuatuoGPT-Vision/blob/. I tested https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference The calling method of offline the printing result is

Prompt: 'Hello, my name is', Generated text: '!!!!!!!!!!!!!!!!'
Prompt: 'The president of the United States is', Generated text: '!!!!!!!!!!!!!!!!!'
Prompt: 'The capital of France is', Generated text: '!!!!!!!!!!!!!!!!!!!'
Prompt: 'The future of AI is', Generated text: '!!!!!!!!!!!!!!!!!'

Need your help!

DarkLight1337 commented 2 weeks ago

This sounds a lot like the problem encountered by @fyabc when implementing Qwen2-VL in #7905. In this case, the problem may be inside Qwen2 backbone itself.

DarkLight1337 commented 2 weeks ago

By the way, please provide a bit more details about how you modified the JSON file.

Chuyun-Shen commented 2 weeks ago

By the way, please provide a bit more details about how you modified the JSON file.

I add the text_config in the config.json file:

{
  "_name_or_path": "HuatuoGPT-Vision-7B",
  "architectures": [
    "LlavaForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "image_aspect_ratio": "pad",
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 131072,
  "max_window_layers": 28,
  "mm_hidden_size": 1024,
  "mm_projector_type": "mlp2x_gelu",
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -2,
  "mm_vision_tower": "./vit/clip_vit_large_patch14_336",
  "model_type": "llava_qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 131072,
  "tie_word_embeddings": false,
  "tokenizer_model_max_length": 4096,
  "tokenizer_padding_side": "right",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.40.0.dev0",
  "tune_mm_mlp_adapter": false,
  "use_cache": false,
  "use_mm_proj": true,
  "use_sliding_window": false,
  "vocab_size": 152064,
  "text_config": {
    "_name_or_path": "Qwen/Qwen2-7B-Instruct",
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bos_token_id": 151643,
    "eos_token_id": 151645,
    "hidden_act": "silu",
    "hidden_size": 3584,
    "initializer_range": 0.02,
    "intermediate_size": 18944,
    "max_position_embeddings": 32768,
    "max_window_layers": 28,
    "model_type": "qwen2",
    "num_attention_heads": 28,
    "num_hidden_layers": 28,
    "num_key_value_heads": 4,
    "rms_norm_eps": 1e-06,
    "rope_theta": 1000000.0,
    "sliding_window": 131072,
    "tie_word_embeddings": false,
    "torch_dtype": "bfloat16",
    "transformers_version": "4.41.2",
    "use_cache": true,
    "use_sliding_window": false,
    "vocab_size": 152064
  }
}

and add a preprocessor_config.json file referring https://huggingface.co/llava-hf/llava-1.5-7b-hf/blob/main/preprocessor_config.json

{
    "crop_size": {
      "height": 336,
      "width": 336
    },
    "do_center_crop": true,
    "do_convert_rgb": true,
    "do_normalize": true,
    "do_rescale": true,
    "do_resize": true,
    "image_mean": [
      0.48145466,
      0.4578275,
      0.40821073
    ],
    "image_processor_type": "CLIPImageProcessor",
    "image_std": [
      0.26862954,
      0.26130258,
      0.27577711
    ],
    "processor_class": "LlavaProcessor",
    "resample": 3,
    "rescale_factor": 0.00392156862745098,
    "size": {
      "shortest_edge": 336
    }
  }
DarkLight1337 commented 2 weeks ago

Some of the fields in LlavaQwen2Config actually belong to the language backbone. You may have to replace some of the fields in the original Qwen2Config with those values.

Chuyun-Shen commented 2 weeks ago

Thanks for your response. I'm trying to figure out how to use both images and text as input in this case. The code at https://github.com/FreedomIntelligence/HuatuoGPT-Vision/blob/main/cli.py seems to handle this, but it uses a different approach than the VLLM API. Do you have any suggestions or examples of how to combine image paths and text for prompting LLM, ideally compatible with the VLLM framework?

DarkLight1337 commented 2 weeks ago

For now, this is best handled by the OpenAI-compatible server which supports multi-modal inputs directly according to OpenAI API spec. The offline LLM.chat method currently only supports text inputs. If you wish to perform offline inference, you can try to use vllm.entrypoints.chat_utils to process the multi-modal inputs beforehand.

Chuyun-Shen commented 2 weeks ago

I try the example in https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html:

 from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
  model="NousResearch/Meta-Llama-3-8B-Instruct",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message)

However get nothing in io output. The log shows:

INFO 09-03 11:22:28 logger.py:36] Received request chat-7b8a6a1dfefa41b89eb94b03d4bda69b: prompt: '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131052, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 151645, 198, 151644, 872, 198, 9707, 0, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
INFO 09-03 11:22:28 async_llm_engine.py:205] Added request chat-7b8a6a1dfefa41b89eb94b03d4bda69b.
DarkLight1337 commented 2 weeks ago

Has the model finished downloading? The example is using a different model than the one you originally used.

Chuyun-Shen commented 2 weeks ago

Sorry for leading misunderstanding, I have changed the script for loading my local download model.

Chuyun-Shen commented 2 weeks ago

Sorry for leading misunderstanding, I have changed the script for loading my local download model.

For detailed, I used CUDA_VISIBLE_DEVICES=2 vllm serve /home/chuyun/hf_hub/HuatuoGPT-Vision-7B --dtype auto --api-key token-abc123 --port 7996 for load HuatuoGPT-Vision-7B

Then I run

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:7996/v1",
    api_key="token-abc123",
)

completion = client.chat.completions.create(
  model="/home/chuyun/hf_hub/HuatuoGPT-Vision-7B",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message)

It keeps running without ending and doesn't print any information.

DarkLight1337 commented 2 weeks ago

Can the model run in offline mode with just text? What does your JSON config look like now?

Chuyun-Shen commented 2 weeks ago

offline mode returns:

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:00<00:00, 174.82it/s]

INFO 09-03 12:30:06 model_runner.py:917] Loading model weights took 14.8443 GB
WARNING 09-03 12:30:06 model_runner.py:1084] Computed max_num_seqs (min(256, 512 // 576)) to be less than 1. Setting it to the minimum value of 1.
INFO 09-03 12:30:09 gpu_executor.py:121] # GPU blocks: 23818, # CPU blocks: 4681
INFO 09-03 12:30:18 model_runner.py:1208] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 09-03 12:30:18 model_runner.py:1212] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 09-03 12:30:40 model_runner.py:1327] Graph capturing finished in 22 secs.
Processed prompts: 100%|█████████████| 4/4 [00:00<00:00, 14.00it/s, est. speed input: 77.03 toks/s, output: 224.08 toks/s]
Prompt: 'Hello, my name is', Generated text: '!!!!!!!!!!!!!!!!'
Prompt: 'The president of the United States is', Generated text: '!!!!!!!!!!!!!!!!'
Prompt: 'The capital of France is', Generated text: '!!!!!!!!!!!!!!!!'
Prompt: 'The future of AI is', Generated text: '!!!!!!!!!!!!!!!!'

The config.json is like this:

{
  "_name_or_path": "HuatuoGPT-Vision-7B",
  "architectures": [
    "LlavaForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "hidden_act": "silu",
  "hidden_size": 3584,
  "image_aspect_ratio": "pad",
  "initializer_range": 0.02,
  "intermediate_size": 18944,
  "max_position_embeddings": 131072,
  "max_window_layers": 28,
  "mm_hidden_size": 1024,
  "mm_projector_type": "mlp2x_gelu",
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -2,
  "mm_vision_tower": "./vit/clip_vit_large_patch14_336",
  "model_type": "llava_qwen2",
  "num_attention_heads": 28,
  "num_hidden_layers": 28,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 131072,
  "tie_word_embeddings": false,
  "tokenizer_model_max_length": 4096,
  "tokenizer_padding_side": "right",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.40.0.dev0",
  "tune_mm_mlp_adapter": false,
  "use_cache": false,
  "use_mm_proj": true,
  "use_sliding_window": false,
  "vocab_size": 152064,
  "text_config": {
    "_name_or_path": "Qwen/Qwen2-7B-Instruct",
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bos_token_id": 151643,
    "eos_token_id": 151643,
    "hidden_act": "silu",
    "hidden_size": 3584,
    "initializer_range": 0.02,
    "intermediate_size": 18944,
    "max_position_embeddings": 131072,
    "max_window_layers": 28,
    "model_type": "qwen2",
    "num_attention_heads": 28,
    "num_hidden_layers": 28,
    "num_key_value_heads": 4,
    "rms_norm_eps": 1e-06,
    "rope_theta": 1000000.0,
    "sliding_window": 131072,
    "tie_word_embeddings": false,
    "torch_dtype": "bfloat16",
    "transformers_version": "4.40.0.dev0",
    "use_cache": false,
    "use_sliding_window": false,
    "vocab_size": 152064
  }
}
Chuyun-Shen commented 2 weeks ago

I tried client.completions.create. The code is shown below:

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:7996/v1",
    api_key="token-abc123",
)

completion = client.completions.create(model="/home/chuyun/hf_hub/HuatuoGPT-Vision-7B", prompt="San Francisco is a")
print(completion.choices[0])

it returns CompletionChoice(finish_reason='length', index=0, logprobs=None, text='!!!!!!!!!!!!!!!!', stop_reason=None, prompt_logprobs=None)

DarkLight1337 commented 2 weeks ago

For the online inference using Chat Completions API, you may be missing the chat template. Nevertheless, we still have the problem of nonsense output.

DarkLight1337 commented 2 weeks ago

Is the model designed for text-only input or is it necessary to input both text and image?

Chuyun-Shen commented 2 weeks ago

Is the model designed for text-only input or is it necessary to input both text and image?

This model is designed for both image and text, and they provide an offline mode CLI code: https://github.com/FreedomIntelligence/HuatuoGPT-Vision/blob/main/cli.py I need to input both text and image.