Assert error in convert_llava_onevision_weights_to_hf.py

System Info

transformers version: 4.46.0
Platform: Linux-5.15.0-97-generic-x86_64-with-glibc2.35
Python version: 3.12.3
Huggingface_hub version: 0.26.1
Safetensors version: 0.4.5
Accelerate version: 1.0.1
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA GeForce RTX 4090

Who can help?

@zucchini-nlp

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I copied convert_llava_onevision_weights_to_hf.py as convert.py, and run:

python convert.py --pytorch_dump_folder_path ./0.5b --model_id lmms-lab/llava-onevision-qwen2-0.5b-ov
python convert.py --pytorch_dump_folder_path ./7b --model_id lmms-lab/llava-onevision-qwen2-7b-ov

Then I encountered an assertion error; it appears that the logits produced by the converted model do not match those specified in the script.

lmms-lab/llava-onevision-qwen2-0.5b-ov output:

$ python convert.py --pytorch_dump_folder_path ./0.5b --model_id lmms-lab/llava-onevision-qwen2-0.5b-ov

{'_name_or_path': '/mnt/bn/vl-research/checkpoints/onevision/llavanext-google_siglip-so400m-patch14-384-Qwen_Qwen2-0.5B-Instruct-mid_to_final_next_3p2m_am9_july21', 'architectures': ['LlavaQwenForCausalLM'], 'attention_dropout': 0.0, 'mm_newline_position': 'one_token', 'bos_token_id': 151643, 'eos_token_id': 151645, 'hidden_act': 'silu', 'hidden_size': 896, 'image_aspect_ratio': 'anyres_max_9', 'image_crop_resolution': None, 'image_grid_pinpoints': [[384, 384], [384, 768], [384, 1152], [384, 1536], [384, 1920], [384, 2304], [768, 384], [768, 768], [768, 1152], [768, 1536], [768, 1920], [768, 2304], [1152, 384], [1152, 768], [1152, 1152], [1152, 1536], [1152, 1920], [1152, 2304], [1536, 384], [1536, 768], [1536, 1152], [1536, 1536], [1536, 1920], [1536, 2304], [1920, 384], [1920, 768], [1920, 1152], [1920, 1536], [1920, 1920], [1920, 2304], [2304, 384], [2304, 768], [2304, 1152], [2304, 1536], [2304, 1920], [2304, 2304]], 'image_split_resolution': None, 'image_token_index': 151646, 'initializer_range': 0.02, 'intermediate_size': 4864, 'max_position_embeddings': 32768, 'max_window_layers': 24, 'mm_hidden_size': 1152, 'mm_patch_merge_type': 'spatial_unpad', 'mm_projector_lr': None, 'mm_projector_type': 'mlp2x_gelu', 'mm_resampler_type': None, 'mm_spatial_pool_mode': 'bilinear', 'mm_tunable_parts': 'mm_vision_tower,mm_mlp_adapter,mm_language_model', 'mm_use_im_patch_token': False, 'mm_use_im_start_end': False, 'mm_vision_select_feature': 'patch', 'mm_vision_select_layer': -2, 'mm_vision_tower': 'google/siglip-so400m-patch14-384', 'mm_vision_tower_lr': 2e-06, 'model_type': 'llava', 'num_attention_heads': 14, 'num_hidden_layers': 24, 'num_key_value_heads': 2, 'pos_skipping_range': 4096, 'rms_norm_eps': 1e-06, 'rope_scaling': None, 'rope_theta': 1000000.0, 'sliding_window': 32768, 'tie_word_embeddings': True, 'tokenizer_model_max_length': 32768, 'tokenizer_padding_side': 'right', 'torch_dtype': 'bfloat16', 'transformers_version': '4.40.0.dev0', 'use_cache': True, 'use_mm_proj': True, 'use_pos_skipping': False, 'use_sliding_window': False, 'vision_tower_pretrained': None, 'vocab_size': 151936}
Fetching 1 files: 100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 2460.00it/s]
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
Saving model and processor for lmms-lab/llava-onevision-qwen2-0.5b-ov to ./0.5b
Single forward pass
Shape of logits: torch.Size([1, 6578, 152000])
First values of logits: tensor([[-12.0234, -14.3828, -12.7500],
        [  2.3828,   1.0283,   3.9512],
        [  3.6641,   4.7031,   9.1172]], device='cuda:0')
Traceback (most recent call last):
  File "/root/autodl-tmp/convert.py", line 388, in <module>
    convert_llava_to_hf(args.model_id, args.pytorch_dump_folder_path, args.push_to_hub)
  File "/root/autodl-tmp/convert.py", line 288, in convert_llava_to_hf
    assert torch.allclose(outputs.logits[0, :3, :3], expected_slice, atol=1e-4)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Half did not match Float

lmms-lab/llava-onevision-qwen2-7b-ov output:

$ python convert.py --pytorch_dump_folder_path ./7b --model_id lmms-lab/llava-onevision-qwen2-7b-ov

{'_name_or_path': '/mnt/bn/vl-research/checkpoints/onevision/llavanext-google_siglip-so400m-patch14-384-Qwen_Qwen2-7B-Instruct-mid_to_final_next_2p4m_am4', 'architectures': ['LlavaQwenForCausalLM'], 'mm_newline_position': 'one_token', 'attention_dropout': 0.0, 'bos_token_id': 151643, 'eos_token_id': 151645, 'hidden_act': 'silu', 'hidden_size': 3584, 'image_token_index': 151646, 'image_aspect_ratio': 'anyres_max_9', 'image_crop_resolution': None, 'image_grid_pinpoints': [[384, 384], [384, 768], [384, 1152], [384, 1536], [384, 1920], [384, 2304], [768, 384], [768, 768], [768, 1152], [768, 1536], [768, 1920], [768, 2304], [1152, 384], [1152, 768], [1152, 1152], [1152, 1536], [1152, 1920], [1152, 2304], [1536, 384], [1536, 768], [1536, 1152], [1536, 1536], [1536, 1920], [1536, 2304], [1920, 384], [1920, 768], [1920, 1152], [1920, 1536], [1920, 1920], [1920, 2304], [2304, 384], [2304, 768], [2304, 1152], [2304, 1536], [2304, 1920], [2304, 2304]], 'image_split_resolution': None, 'initializer_range': 0.02, 'intermediate_size': 18944, 'max_position_embeddings': 32768, 'max_window_layers': 28, 'mm_hidden_size': 1152, 'mm_patch_merge_type': 'spatial_unpad', 'mm_projector_lr': None, 'mm_projector_type': 'mlp2x_gelu', 'mm_resampler_type': None, 'mm_spatial_pool_mode': 'bilinear', 'mm_tunable_parts': 'mm_vision_tower,mm_mlp_adapter,mm_language_model', 'mm_use_im_patch_token': False, 'mm_use_im_start_end': False, 'mm_vision_select_feature': 'patch', 'mm_vision_select_layer': -2, 'mm_vision_tower': 'google/siglip-so400m-patch14-384', 'mm_vision_tower_lr': 2e-06, 'model_type': 'llava', 'num_attention_heads': 28, 'num_hidden_layers': 28, 'num_key_value_heads': 4, 'pos_skipping_range': 4096, 'rms_norm_eps': 1e-06, 'rope_scaling': None, 'rope_theta': 1000000.0, 'sliding_window': 131072, 'tie_word_embeddings': False, 'tokenizer_model_max_length': 32768, 'tokenizer_padding_side': 'right', 'torch_dtype': 'bfloat16', 'transformers_version': '4.40.0.dev0', 'use_cache': True, 'use_mm_proj': True, 'use_pos_skipping': False, 'use_sliding_window': False, 'vision_tower_pretrained': None, 'vocab_size': 152064}
Fetching 4 files: 100%|██████████████████████████████████████████████| 4/4 [00:00<00:00, 298.81it/s]
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
Saving model and processor for lmms-lab/llava-onevision-qwen2-7b-ov to ./7b
Loading checkpoint shards: 100%|██████████████████████████████████████| 4/4 [00:03<00:00,  1.28it/s]
Single forward pass
Shape of logits: torch.Size([1, 6578, 152128])
First values of logits: tensor([[1.8486, 3.4219, 1.3125],
        [3.1191, 3.0195, 3.1660],
        [4.2461, 4.7227, 9.9609]], device='cuda:0')
Traceback (most recent call last):
  File "/root/autodl-tmp/convert.py", line 388, in <module>
    convert_llava_to_hf(args.model_id, args.pytorch_dump_folder_path, args.push_to_hub)
  File "/root/autodl-tmp/convert.py", line 288, in convert_llava_to_hf
    assert torch.allclose(outputs.logits[0, :3, :3], expected_slice, atol=1e-4)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Half did not match Float

Expected behavior

The output logits remain consistent and do not produce an assertion error.

Hi Raushan @zucchini-nlp , as far as I know, this script is contributed by you.

Besides, you are a member of the llava-hf team on HuggingFace and have contributed a lot of easy-to-use llava models with great passion.

Do you have any idea about this problem?

I suspect this issue is related to the precision of the machine.

When I attempted to convert the model lmms-lab/llava-onevision-qwen2-0.5b-ov on another server equipped with an RTX 2080 Ti, its logits were not only different from those specified in the script but also differed from what I get above on the RTX 4090 above.

The output of the conversion performed on the RTX 2080Ti :

$ python convert.py --pytorch_dump_folder_path ./0.5b --model_id lmms-lab/llava-onevision-qwen2-0.5b-ov

{'_name_or_path': '/mnt/bn/vl-research/checkpoints/onevision/llavanext-google_siglip-so400m-patch14-384-Qwen_Qwen2-0.5B-Instruct-mid_to_final_next_3p2m_am9_july21', 'architectures': ['LlavaQwenForCausalLM'], 'attention_dropout': 0.0, 'mm_newline_position': 'one_token', 'bos_token_id': 151643, 'eos_token_id': 151645, 'hidden_act': 'silu', 'hidden_size': 896, 'image_aspect_ratio': 'anyres_max_9', 'image_crop_resolution': None, 'image_grid_pinpoints': [[384, 384], [384, 768], [384, 1152], [384, 1536], [384, 1920], [384, 2304], [768, 384], [768, 768], [768, 1152], [768, 1536], [768, 1920], [768, 2304], [1152, 384], [1152, 768], [1152, 1152], [1152, 1536], [1152, 1920], [1152, 2304], [1536, 384], [1536, 768], [1536, 1152], [1536, 1536], [1536, 1920], [1536, 2304], [1920, 384], [1920, 768], [1920, 1152], [1920, 1536], [1920, 1920], [1920, 2304], [2304, 384], [2304, 768], [2304, 1152], [2304, 1536], [2304, 1920], [2304, 2304]], 'image_split_resolution': None, 'image_token_index': 151646, 'initializer_range': 0.02, 'intermediate_size': 4864, 'max_position_embeddings': 32768, 'max_window_layers': 24, 'mm_hidden_size': 1152, 'mm_patch_merge_type': 'spatial_unpad', 'mm_projector_lr': None, 'mm_projector_type': 'mlp2x_gelu', 'mm_resampler_type': None, 'mm_spatial_pool_mode': 'bilinear', 'mm_tunable_parts': 'mm_vision_tower,mm_mlp_adapter,mm_language_model', 'mm_use_im_patch_token': False, 'mm_use_im_start_end': False, 'mm_vision_select_feature': 'patch', 'mm_vision_select_layer': -2, 'mm_vision_tower': 'google/siglip-so400m-patch14-384', 'mm_vision_tower_lr': 2e-06, 'model_type': 'llava', 'num_attention_heads': 14, 'num_hidden_layers': 24, 'num_key_value_heads': 2, 'pos_skipping_range': 4096, 'rms_norm_eps': 1e-06, 'rope_scaling': None, 'rope_theta': 1000000.0, 'sliding_window': 32768, 'tie_word_embeddings': True, 'tokenizer_model_max_length': 32768, 'tokenizer_padding_side': 'right', 'torch_dtype': 'bfloat16', 'transformers_version': '4.40.0.dev0', 'use_cache': True, 'use_mm_proj': True, 'use_pos_skipping': False, 'use_sliding_window': False, 'vision_tower_pretrained': None, 'vocab_size': 151936}
Fetching 1 files: 100%|████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2455.68it/s]
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
Saving model and processor for lmms-lab/llava-onevision-qwen2-0.5b-ov to ./0.5b
Single forward pass
Shape of logits: torch.Size([1, 6578, 152000])
First values of logits: tensor([[-12.0234, -14.3828, -12.7500],
        [  2.3594,   1.0078,   3.9277],
        [  3.6562,   4.7148,   9.1172]], device='cuda:0')
Traceback (most recent call last):
  File "/home/xxxx/Code/RouteMLLM/RouteMLLM/llava-critic/converter.py", line 388, in <module>
    convert_llava_to_hf(args.model_id, args.pytorch_dump_folder_path, args.push_to_hub)
  File "/home/xxxx/Code/RouteMLLM/RouteMLLM/llava-critic/converter.py", line 288, in convert_llava_to_hf
    assert torch.allclose(outputs.logits[0, :3, :3], expected_slice, atol=1e-4)
RuntimeError: Half did not match Float

The RTX 2080 Ti server specs are as follows:

Platform: Linux-6.5.0-26-generic-x86_64-with-glibc2.35
Python version: 3.12.7
Huggingface_hub version: 0.26.2
Safetensors version: 0.4.5
Accelerate version: 1.0.1
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA GeForce RTX 2080 Ti

Hi Raushan @zucchini-nlp , as far as I know, this script is contributed by you.

Besides, you are a member of the llava-hf team on HuggingFace and have contributed a lot of easy-to-use llava models with great passion.

Do you have any idea about this problem?

Could you please specify the hardware used for the conversions and explain how the standard logit values were obtained?

Thanks! 😊

Hey @FuryMartin !

Yes, the differences can be caused by the hardware and also by torch version. When I converted the weights I used 80GB A100 and I'll try to run once more tomorrow to see if the logits will be similar. If you are trying to convert weights in your own fine-tuned model, I'd suggest to just make sure the logits match on your machine, then you should be good to go :)

Thansk for the quick reply! @zucchini-nlp

I will also try to convert the models on an 80G A100 server and check the logits.

By the way, I am indeed trying to convert a new model named lmms-lab/LLaVA-Critic-7B, which is fine-tuned based on LLaVA-One-Vision-series, so I guess this script should also be able to convert it.

However, there's still one thing that isn't clear to me: how can I obtain such an expected_slice as below?

https://github.com/huggingface/transformers/blob/fc1ae7f30f1d16c7652c28dd8d91c5d8a8ed2f15/src/transformers/models/llava_onevision/convert_llava_onevision_weights_to_hf.py#L230-L234

Given that the LLaVA repository contains many custom image preprocessing steps and other abstract inference APIs, I'm not sure how to obtain reference logits.

If you have any script for quickly obtaining them, would you mind to share it with me? Thanks a lot!

Thansk for the quick reply! @zucchini-nlp

I will also try to convert the models on an 80G A100 server and check the logits.

By the way, I am indeed trying to convert a new model named lmms-lab/LLaVA-Critic-7B, which is fine-tuned based on LLaVA-One-Vision-series, so I guess this script should also be able to convert it.

However, there's still one thing that isn't clear to me: how can I obtain such an expected_slice as below?

https://github.com/huggingface/transformers/blob/fc1ae7f30f1d16c7652c28dd8d91c5d8a8ed2f15/src/transformers/models/llava_onevision/convert_llava_onevision_weights_to_hf.py#L230-L234

Given that the LLaVA repository contains many custom image preprocessing steps and other abstract inference APIs, I'm not sure how to obtain reference logits.

If you have any script for quickly obtaining them, would you mind to share it with me? Thanks a lot!

Oh I have figured it out how to get the logits using the official llava-next framework's guide. Put it here for anyone who might need it too:

with torch.no_grad():
    outputs = model(
        input_ids=input_ids,
        images=image_tensor,
        image_sizes=image_sizes,
        output_hidden_states=True,  
        output_attentions=False      
    )

logits = outputs.logits

print(logits)

I tried lmms-lab/llava-onevision-qwen2-0.5b-ov on my RTX 2080Ti, and its output is the same as the converted model.

tensor([[[-12.0234, -14.3828, -12.7500,  ...,   9.0547,   8.8906,  15.0938],
         [  2.3594,   1.0078,   3.9277,  ...,   8.4609,  14.5000,  -2.7344],
         [  3.6562,   4.7148,   9.1172,  ...,   4.8672,   2.9199,  -4.6914],
         ...,
         [  0.3706,  -2.3965,  -0.6265,  ...,  14.5703,  11.2344,   2.6934],
         [ -1.1260,  -0.2316,  -1.3408,  ...,  11.0859,  13.5156,   0.5308],
         [  4.8906,  14.5156,  10.2812,  ...,   6.6797,   9.5469,  -0.4673]]],
       device='cuda:0', dtype=torch.float16)

Awesome, thanks for sharing! Would you like the new models to be added to llava-hf repository on the hub?

I saw the discussion opened by you on the hub, so feel free to open a PR with updated conversion script that covers llava-critic models. After that PR I can convert and upload models to the official repo :)

Regarding the questions:

However, I am still unclear on how to obtain expected_text, and uncertain whether converting on my RTX 4090 server would result in a loss of precision.

Yes, as long as the logits match on your machine it is fine and should not result in big loss of precision. For the expected text, you can use the official repo but this time instead of forward call model.generate() and make sure that the prompts are formatted to chat template in the same way

Awesome, thanks for sharing! Would you like the new models to be added to llava-hf repository on the hub?

I saw the discussion opened by you on the hub, so feel free to open a PR with updated conversion script that covers llava-critic models. After that PR I can convert and upload models to the official repo :)

Yeah I'd like to add LLaVA-Critic to llava-hf and willing to contribute to this script.

I have converted LLaVA-Critic-7b successfully. However, I got stuck in the verification process.

I have passed the single forward verification, but as I pointed out on the hub, I'm still confused about how to generate the expected_text for LLaVA-Critic to verify generation.

https://github.com/huggingface/transformers/blob/fc1ae7f30f1d16c7652c28dd8d91c5d8a8ed2f15/src/transformers/models/llava_onevision/convert_llava_onevision_weights_to_hf.py#L302-L317

How do you get them for LLaVa-OneVision ? As LLaVA-Critic was fine-tuned on LLaVA-OneVision, I believe they share the same process to get expected_text

Cool, adding it to the official repo sounds good!

How do you get them for 'LLaVa-OneVision' ?

It should be almost same as you did above, but just call generate as below. I used this notebook when obtaining generation results -> https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision_Tutorials.ipynb

with torch.no_grad():
    outputs = model.generate(
        input_ids=input_ids,
        images=image_tensor,
        image_sizes=image_sizes,
        do_sample=False,
        max_new_tokens=20, # or any max new tokens that is also going to be in the conversion script
    )

text_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(text_outputs)

It should be almost same as you did above, but just call generate as below. I used this notebook when obtaining generation results -> https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision_Tutorials.ipynb
with torch.no_grad():
    outputs = model.generate(
        input_ids=input_ids,
        images=image_tensor,
        image_sizes=image_sizes,
        do_sample=False,
        max_new_tokens=20, # or any max new tokens that is also going to be in the conversion script
    )

text_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(text_outputs)

Hi, I have extracted the inference code from the notebook as below. I tried to get the expected_text for lmms-lab/llava-onevision-qwen2-0.5b-ov to verify the conversion process.

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle

from PIL import Image
import requests
import copy
import torch

import sys
import warnings

warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
llava_model_args = {
    "multimodal": True,
    "attn_implementation": "sdpa",
}
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, **llava_model_args)  # Add any other thing you want to pass in llava_model_args

model.eval()

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]

conv_template = "qwen_1_5"  # Make sure you use correct chat template for different models
question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]

cont = model.generate(
    input_ids,
    images=image_tensor,
    image_sizes=image_sizes,
    do_sample=False,
    max_new_tokens=100,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)

After running the inference code, I got output:

The image is a radar chart that compares the performance of different models in terms of their accuracy and reliability. The chart shows various benchmarks such as BLIP-2, InstructBLIP, Qwen-VL-Chat, PODE-Bench, MM-Bench, and SEED-Bench, along with other models like VQA (Visual Question Answering), GQA (General Knowledge Question Answering), and SQA-IMG (Squad Image Question Answering). Each model's performance is

However, it differs entirely from both the expected_text in convert_llava_onevision_weights_to_hf.py and the generated_text produced by the converted llava-onevision-qwen2-0.5b-ov.

make sure that the prompts are formatted to chat template in the same way

I believe this problem is related to the chat template. I'm not sure how to format the prompts to the chat template in the llava's inference example above. Could you please give a demo code?

The difference between the one you got from LLaVA-VL and the one in expected_text in convert_llava_onevision_weights_to_hf.py can be explained by the differences in hardware probably, same way as you got different logits.

So let's better focus on getting same result when converting the weights with the same machine. If I am not mistaken, the logits are matching 100% so in that case the issue is probably the chat format yes. You can print the prompt_question to see how the question is formatted in LLaVA-VL repo and the print the result of apply_chat_template from below line to compare if they match

conversation = [
    {

      "role": "user",
      "content": [
          {"type": "image"},
          {"type": "text", "text": "What are these?"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

Ahhh still not working.

This time I tried to convert the llava-onevision-qwen2-0.5b-ov on an 80G A100 machine.

Single Forward Pass Assertion: Pass

Maybe because I adopted the same hardware as yours, I passed the test without any modification to expected_slice in convert_llava_onevision_weights_to_hf.py.

Generation Assertion: Fail

The generated_text is:

(for better visual effects, I removed prompt info and add a \n to break the line manually )

The image is a radar chart that compares the performance of different models in a specific task, likely related to natural language processing or machine learning. 
The chart is divided into several axes, each representing a different model or method. The models include BLIP-2, InstructBLIP, Qwen-VL-Chat, and LLaVA-1.5. The radar chart shows the performance scores for each model, with the scores ranging from 1 to 80. The models are evaluated

The expected_text is:

The image is a radar chart that compares the performance of different models in a specific task, likely related to natural language processing or machine learning. 
The chart is divided into different categories, each represented by a different color and labeled with the name of the model or technique used. The models are evaluated based on their performance metrics, such as BLEU-2, InstructBLIP, Qwen-VL-Chat, and LLaVA-1.5. The radar chart helps to visualize the relative

For contrast, the output text I get from LLaVA-VL's notebook is:

The image is a radar chart that compares the performance of different models in terms of their accuracy and reliability. 
The chart shows various benchmarks such as BLIP-2, InstructBLIP, Qwen-VL-Chat, PODE-Bench, MM-Bench, and SEED-Bench, along with other models like VQA (Visual Question Answering), GQA (General Knowledge Question Answering), and SQA-IMG (Squad Image Question Answering). Each model's performance is

In my opinion

generated_text is OK not to be the same as expected_text because we may have different CUDA versions or other environment things.
generated_text should be the same as LLaVA-VL's output, but in fact they are not.

To find the reason, I investigated more.

I saved the output logits in the LLaVA-VL code.

# LLaVA-OneVision Tutorial

with torch.no_grad():
    outputs = model(
        input_ids=input_ids,
        images=image_tensor,
        image_sizes=image_sizes,
        output_hidden_states=True,
        output_attentions=False
    )

logits = outputs.logits

torch.save(outputs,"target.pt")

and load it in the conversion script:

#  convert_llava_onevision_weights_to_hf.py

    # verify single forward pass
    print("Single forward pass")
    with torch.inference_mode():
        inputs = inputs.to(device)
        outputs = model(**inputs)
        print("Shape of logits:", outputs.logits.shape)
        print("First values of logits:", outputs.logits[0, :3, :3])

        target = torch.load("target.pt")

        print("Target Logits", target.logits)
        print("Output Logits", outputs.logits)

        assert torch.allclose(outputs.logits[0, :3, :3], target.logits[0, :3, :3], atol=1e-4)
        print("Slice Assertion Passed")

        assert torch.allclose(outputs.logits, target.logits, atol=1e-4)
        print("Full Assertion Passed")

by running the converion script, I got

Single forward pass
Shape of logits: torch.Size([1, 6578, 152000])
First values of logits: tensor([[-12.0234, -14.3828, -12.7500],
        [  2.3594,   1.0000,   3.9336],
        [  3.6582,   4.7148,   9.1172]], device='cuda:0', dtype=torch.float16)
Target Logits tensor([[[-12.0234, -14.3828, -12.7500,  ...,   9.0469,   8.8906,  15.0859],
         [  2.3594,   1.0000,   3.9336,  ...,   8.4688,  14.5156,  -2.7188],
         [  3.6582,   4.7148,   9.1172,  ...,   4.8672,   2.9141,  -4.6914],
         ...,
         [  0.3838,  -2.3594,  -0.6416,  ...,  14.5859,  11.2109,   2.6992],
         [ -1.1221,  -0.2347,  -1.3438,  ...,  11.0938,  13.5469,   0.5312],
         [  4.8867,  14.5312,  10.2812,  ...,   6.6719,   9.5391,  -0.4631]]],
       device='cuda:0', dtype=torch.float16)
Output Logits tensor([[[-12.0234, -14.3828, -12.7500,  ...,   9.3047,   9.2812,   9.2969],
         [  2.3594,   1.0000,   3.9336,  ...,  -1.7090,  -1.7109,  -1.7070],
         [  3.6582,   4.7148,   9.1172,  ...,  -2.5684,  -2.5723,  -2.5801],
         ...,
         [  0.3813,  -2.3613,  -0.6250,  ...,   1.5645,   1.5615,   1.5654],
         [ -1.1162,  -0.2271,  -1.3398,  ...,   0.4922,   0.4626,   0.4805],
         [  4.8867,  14.5469,  10.2969,  ...,   0.4358,   0.4336,   0.4084]]],
       device='cuda:0', dtype=torch.float16)
Slice Assertion Passed
Traceback (most recent call last):
  File "/root/autodl-tmp/convert.py", line 406, in <module>
    convert_llava_to_hf(args.model_id, args.pytorch_dump_folder_path, args.push_to_hub)
  File "/root/autodl-tmp/convert.py", line 239, in convert_llava_to_hf
    assert torch.allclose(outputs.logits, target.logits, atol=1e-4)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The size of tensor a (152000) must match the size of tensor b (151647) at non-singleton dimension 2

which means although they have the same [0, :3, :3] slice, the other parts have difference.

What kind of reason may lead to this situation? Could there be any problems during the conversion process?

Hmm, afaik the llava implementation defaults to FA2 when loading the model + we need to check the dtypes when loading the model. I think that could be the reason why logits are slightly different and why generated text starts diverging as more tokens are generated

How much is the diff for the logits if we don't take it as a slice but compare the whole tensor? If it is below tolerance level, I think we should be good even if generated texts don't match 100%. The difference in that case if most probably is in the way model is loaded

afaik the llava implementation defaults to FA2 when loading the model + we need to check the dtypes when loading the model.

Yes, but actually, I'm using sdpa as the attention implementation in LLaVA-OneVision-Tutorial code.

I'm not familiar with the dtype change process, so I'm afraid that I can't provide much help about this.

How much is the diff for the logits if we don't take it as a slice but compare the whole tensor? If it is below tolerance level

From the results I just printed, the deviation seems quite significant. We can examine the first row of each tensor.

Target: [-12.0234, -14.3828, -12.7500, ..., 9.0469, 8.8906, 15.0859] Generated: [-12.0234, -14.3828, -12.7500, ..., 9.3047, 9.2812, 9.2969]

It can be seen that the differences between the last few values are quite large. I'm not sure how these differences will influence the final output.

Besides, the lengths of logits are also different. Perhaps this is due to addtional special tokens introduced during the conversion process.

Maybe we can test the converted model on some benchmarks such as MMMU to evaluate possible accuracy loss.

I'm not familiar with the dtype change process, so I'm afraid that I can't provide much help about this.

It can be set when loading the model with XXXModel.from_pretrained(model_id, torch_dtype="float16"). I guess the llava repo uses bf16 but I might be mistaken.

Besides, the lengths of logits are also different. Perhaps this is due to addtional special tokens introduced during the conversion process.

Oh yeah, you're right, it also influences. Can you see the (logit_llava - logit_converted[:, :-1, :]).abs().max() by cropping the last token from converted model logits? Or maybe more than one token if we add more?

Maybe we can test the converted model on some benchmarks such as MMMU to evaluate possible accuracy loss.

IMO this is too much for simply converting the weights. What we usually do is to match to logits and the generated results. I'd love to help you out with debugging, but I might be slow this week. If you have a working branch feel free to open a PR so we can run and test it out together.

In general, I am okay as long as the logits match, the generation should match in that case also unless the inputs are formatted differently :)

It can be set when loading the model with XXXModel.from_pretrained(model_id, torch_dtype="float16"). I guess the llava repo uses bf16 but I might be mistaken.

I checked load_pretrained_model() method used in LLaVA-VL.

Unfortunately, the default torch_dtype is indeed float16, which is the same as in the conversion script:

# LLaVA-VL

def load_pretrained_model(
    model_path: Any,
    model_base: Any,
    model_name: Any,
    load_8bit: bool = False,
    load_4bit: bool = False,
    device_map: str = "auto",
    torch_dtype: str = "float16",
    attn_implementation: str = "flash_attention_2",
    customized_config: Any | None = None,
    overwrite_config: Any | None = None,
    **kwargs: Any
) -> Any

Oh yeah, you're right, it also influences. Can you see the (logit_llava - logit_converted[:, :-1, :]).abs().max() by cropping the last token from converted model logits? Or maybe more than one token if we add more?

I tried, but the problem seems serious. The lengths of the 3rd dimension are totally different.

ipdb> logit_llava .shape
torch.Size([1, 6578, 151647])

ipdb> logit_converted.shape
torch.Size([1, 6578, 152000])

IMO this is too much for simply converting the weights. What we usually do is to match to logits and the generated results.

In general, I am okay as long as the logits match, the generation should match in that case also unless the inputs are formatted differently :)

I agree. 👍We only need to match the logits.

I'd love to help you out with debugging, but I might be slow this week. If you have a working branch feel free to open a PR so we can run and test it out together.

Thanks for the help! 🥰 However, I need to spare some effort to conduct other tasks as well.

I'll try to investigate more about the conversion process if I have time. Perhaps there are some minor mistakes causing the logits to not match.

I tried, but the problem seems serious. The lengths of the 3rd dimension are totally different.

Weird, this should not happen because we don't resize dimensionality of lm-head. What can be different is the second-dim for token length depending on how we tokenized/formatted inputs. Yes, in that case I agree conversion is doing smth wrong

No worries, take your time and lmk if you need further assistance 🤗

Hi @FuryMartin, I am also trying to convert llava-critic to hf version. Could you share the modification for implementing this. BTW, I need to convert many local llava-ov models into hf version. I am wondering if there any big differences.

Hi @FuryMartin, I am also trying to convert llava-critic to hf version. Could you share the modification for implementing this. BTW, I need to convert many local llava-ov models into hf version. I am wondering if there any big differences.

Hi, the conversion is easy. The core code to convert a llava-ov model is : https://github.com/huggingface/transformers/blob/86701f2b6ff2085a3cd3ad1d30bc2ff2b10fbd94/src/transformers/models/llava_onevision/convert_llava_onevision_weights_to_hf.py#L98-L196

You only need to add an extra line to the text_model_id specifying process. For example, adding lmms-lab/llava-critic-7b to the following block.

     elif model_id in [ 
         "lmms-lab/llava-onevision-qwen2-7b-ov", 
         "lmms-lab/llava-onevision-qwen2-7b-si", 
         "lmms-lab/llava-onevision-qwen2-7b-ov-chat", 
         "lmms-lab/llava-critic-7b"
     ]: 
         text_model_id = "Qwen/Qwen2-7B-Instruct"

By running the function, you will get a converted model under folder output/.

convert_llava_to_hf("lmms-lab/llava-critic-7b", "output/")

However, as we discussed above, we think that the current conversion has some problem, which may result in accuracy drop compared to the original model. You can use the converted model at your own risks.

huggingface / transformers