Open FuryMartin opened 1 month ago
Hi Raushan @zucchini-nlp , as far as I know, this script is contributed by you.
Besides, you are a member of the llava-hf team on HuggingFace and have contributed a lot of easy-to-use llava models with great passion.
Do you have any idea about this problem?
I suspect this issue is related to the precision of the machine.
When I attempted to convert the model lmms-lab/llava-onevision-qwen2-0.5b-ov
on another server equipped with an RTX 2080 Ti, its logits were not only different from those specified in the script but also differed from what I get above on the RTX 4090 above.
The output of the conversion performed on the RTX 2080Ti :
$ python convert.py --pytorch_dump_folder_path ./0.5b --model_id lmms-lab/llava-onevision-qwen2-0.5b-ov
{'_name_or_path': '/mnt/bn/vl-research/checkpoints/onevision/llavanext-google_siglip-so400m-patch14-384-Qwen_Qwen2-0.5B-Instruct-mid_to_final_next_3p2m_am9_july21', 'architectures': ['LlavaQwenForCausalLM'], 'attention_dropout': 0.0, 'mm_newline_position': 'one_token', 'bos_token_id': 151643, 'eos_token_id': 151645, 'hidden_act': 'silu', 'hidden_size': 896, 'image_aspect_ratio': 'anyres_max_9', 'image_crop_resolution': None, 'image_grid_pinpoints': [[384, 384], [384, 768], [384, 1152], [384, 1536], [384, 1920], [384, 2304], [768, 384], [768, 768], [768, 1152], [768, 1536], [768, 1920], [768, 2304], [1152, 384], [1152, 768], [1152, 1152], [1152, 1536], [1152, 1920], [1152, 2304], [1536, 384], [1536, 768], [1536, 1152], [1536, 1536], [1536, 1920], [1536, 2304], [1920, 384], [1920, 768], [1920, 1152], [1920, 1536], [1920, 1920], [1920, 2304], [2304, 384], [2304, 768], [2304, 1152], [2304, 1536], [2304, 1920], [2304, 2304]], 'image_split_resolution': None, 'image_token_index': 151646, 'initializer_range': 0.02, 'intermediate_size': 4864, 'max_position_embeddings': 32768, 'max_window_layers': 24, 'mm_hidden_size': 1152, 'mm_patch_merge_type': 'spatial_unpad', 'mm_projector_lr': None, 'mm_projector_type': 'mlp2x_gelu', 'mm_resampler_type': None, 'mm_spatial_pool_mode': 'bilinear', 'mm_tunable_parts': 'mm_vision_tower,mm_mlp_adapter,mm_language_model', 'mm_use_im_patch_token': False, 'mm_use_im_start_end': False, 'mm_vision_select_feature': 'patch', 'mm_vision_select_layer': -2, 'mm_vision_tower': 'google/siglip-so400m-patch14-384', 'mm_vision_tower_lr': 2e-06, 'model_type': 'llava', 'num_attention_heads': 14, 'num_hidden_layers': 24, 'num_key_value_heads': 2, 'pos_skipping_range': 4096, 'rms_norm_eps': 1e-06, 'rope_scaling': None, 'rope_theta': 1000000.0, 'sliding_window': 32768, 'tie_word_embeddings': True, 'tokenizer_model_max_length': 32768, 'tokenizer_padding_side': 'right', 'torch_dtype': 'bfloat16', 'transformers_version': '4.40.0.dev0', 'use_cache': True, 'use_mm_proj': True, 'use_pos_skipping': False, 'use_sliding_window': False, 'vision_tower_pretrained': None, 'vocab_size': 151936}
Fetching 1 files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 2455.68it/s]
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
The new lm_head weights will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
Saving model and processor for lmms-lab/llava-onevision-qwen2-0.5b-ov to ./0.5b
Single forward pass
Shape of logits: torch.Size([1, 6578, 152000])
First values of logits: tensor([[-12.0234, -14.3828, -12.7500],
[ 2.3594, 1.0078, 3.9277],
[ 3.6562, 4.7148, 9.1172]], device='cuda:0')
Traceback (most recent call last):
File "/home/xxxx/Code/RouteMLLM/RouteMLLM/llava-critic/converter.py", line 388, in <module>
convert_llava_to_hf(args.model_id, args.pytorch_dump_folder_path, args.push_to_hub)
File "/home/xxxx/Code/RouteMLLM/RouteMLLM/llava-critic/converter.py", line 288, in convert_llava_to_hf
assert torch.allclose(outputs.logits[0, :3, :3], expected_slice, atol=1e-4)
RuntimeError: Half did not match Float
The RTX 2080 Ti server specs are as follows:
Hi Raushan @zucchini-nlp , as far as I know, this script is contributed by you.
Besides, you are a member of the llava-hf team on HuggingFace and have contributed a lot of easy-to-use llava models with great passion.
Do you have any idea about this problem?
Could you please specify the hardware used for the conversions and explain how the standard logit values were obtained?
Thanks! π
Hey @FuryMartin !
Yes, the differences can be caused by the hardware and also by torch version. When I converted the weights I used 80GB A100 and I'll try to run once more tomorrow to see if the logits will be similar. If you are trying to convert weights in your own fine-tuned model, I'd suggest to just make sure the logits match on your machine, then you should be good to go :)
Thansk for the quick reply! @zucchini-nlp
I will also try to convert the models on an 80G A100 server and check the logits.
By the way, I am indeed trying to convert a new model named lmms-lab/LLaVA-Critic-7B, which is fine-tuned based on LLaVA-One-Vision-series, so I guess this script should also be able to convert it.
However, there's still one thing that isn't clear to me: how can I obtain such an expected_slice
as below?
Given that the LLaVA repository contains many custom image preprocessing steps and other abstract inference APIs, I'm not sure how to obtain reference logits.
If you have any script for quickly obtaining them, would you mind to share it with me? Thanks a lot!
Thansk for the quick reply! @zucchini-nlp
I will also try to convert the models on an 80G A100 server and check the logits.
By the way, I am indeed trying to convert a new model named lmms-lab/LLaVA-Critic-7B, which is fine-tuned based on LLaVA-One-Vision-series, so I guess this script should also be able to convert it.
However, there's still one thing that isn't clear to me: how can I obtain such an
expected_slice
as below?Given that the LLaVA repository contains many custom image preprocessing steps and other abstract inference APIs, I'm not sure how to obtain reference logits.
If you have any script for quickly obtaining them, would you mind to share it with me? Thanks a lot!
Oh I have figured it out how to get the logits using the official llava-next framework's guide. Put it here for anyone who might need it too:
with torch.no_grad():
outputs = model(
input_ids=input_ids,
images=image_tensor,
image_sizes=image_sizes,
output_hidden_states=True,
output_attentions=False
)
logits = outputs.logits
print(logits)
I tried lmms-lab/llava-onevision-qwen2-0.5b-ov
on my RTX 2080Ti, and its output is the same as the converted model.
tensor([[[-12.0234, -14.3828, -12.7500, ..., 9.0547, 8.8906, 15.0938],
[ 2.3594, 1.0078, 3.9277, ..., 8.4609, 14.5000, -2.7344],
[ 3.6562, 4.7148, 9.1172, ..., 4.8672, 2.9199, -4.6914],
...,
[ 0.3706, -2.3965, -0.6265, ..., 14.5703, 11.2344, 2.6934],
[ -1.1260, -0.2316, -1.3408, ..., 11.0859, 13.5156, 0.5308],
[ 4.8906, 14.5156, 10.2812, ..., 6.6797, 9.5469, -0.4673]]],
device='cuda:0', dtype=torch.float16)
Awesome, thanks for sharing! Would you like the new models to be added to llava-hf
repository on the hub?
I saw the discussion opened by you on the hub, so feel free to open a PR with updated conversion script that covers llava-critic models. After that PR I can convert and upload models to the official repo :)
Regarding the questions:
However, I am still unclear on how to obtain expected_text, and uncertain whether converting on my RTX 4090 server would result in a loss of precision.
Yes, as long as the logits match on your machine it is fine and should not result in big loss of precision. For the expected text, you can use the official repo but this time instead of forward
call model.generate()
and make sure that the prompts are formatted to chat template in the same way
Awesome, thanks for sharing! Would you like the new models to be added to
llava-hf
repository on the hub?I saw the discussion opened by you on the hub, so feel free to open a PR with updated conversion script that covers llava-critic models. After that PR I can convert and upload models to the official repo :)
Yeah I'd like to add LLaVA-Critic
to llava-hf
and willing to contribute to this script.
I have converted LLaVA-Critic-7b
successfully. However, I got stuck in the verification process.
I have passed the single forward
verification, but as I pointed out on the hub, I'm still confused about how to generate the expected_text
for LLaVA-Critic
to verify generation.
How do you get them for LLaVa-OneVision
? As LLaVA-Critic
was fine-tuned on LLaVA-OneVision
, I believe they share the same process to get expected_text
Cool, adding it to the official repo sounds good!
How do you get them for 'LLaVa-OneVision' ?
It should be almost same as you did above, but just call generate
as below. I used this notebook when obtaining generation results -> https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision_Tutorials.ipynb
with torch.no_grad():
outputs = model.generate(
input_ids=input_ids,
images=image_tensor,
image_sizes=image_sizes,
do_sample=False,
max_new_tokens=20, # or any max new tokens that is also going to be in the conversion script
)
text_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(text_outputs)
It should be almost same as you did above, but just call
generate
as below. I used this notebook when obtaining generation results -> https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision_Tutorials.ipynbwith torch.no_grad(): outputs = model.generate( input_ids=input_ids, images=image_tensor, image_sizes=image_sizes, do_sample=False, max_new_tokens=20, # or any max new tokens that is also going to be in the conversion script ) text_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True) print(text_outputs)
Hi, I have extracted the inference code from the notebook as below. I tried to get the expected_text
for lmms-lab/llava-onevision-qwen2-0.5b-ov
to verify the conversion process.
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
llava_model_args = {
"multimodal": True,
"attn_implementation": "sdpa",
}
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, **llava_model_args) # Add any other thing you want to pass in llava_model_args
model.eval()
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
image_sizes = [image.size]
cont = model.generate(
input_ids,
images=image_tensor,
image_sizes=image_sizes,
do_sample=False,
max_new_tokens=100,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)
After running the inference code, I got output:
The image is a radar chart that compares the performance of different models in terms of their accuracy and reliability. The chart shows various benchmarks such as BLIP-2, InstructBLIP, Qwen-VL-Chat, PODE-Bench, MM-Bench, and SEED-Bench, along with other models like VQA (Visual Question Answering), GQA (General Knowledge Question Answering), and SQA-IMG (Squad Image Question Answering). Each model's performance is
However, it differs entirely from both the expected_text
in convert_llava_onevision_weights_to_hf.py
and the generated_text
produced by the converted llava-onevision-qwen2-0.5b-ov
.
make sure that the prompts are formatted to chat template in the same way
I believe this problem is related to the chat template. I'm not sure how to format the prompts to the chat template in the llava's inference example above. Could you please give a demo code?
The difference between the one you got from LLaVA-VL and the one in expected_text
in convert_llava_onevision_weights_to_hf.py
can be explained by the differences in hardware probably, same way as you got different logits.
So let's better focus on getting same result when converting the weights with the same machine. If I am not mistaken, the logits are matching 100% so in that case the issue is probably the chat format yes. You can print the prompt_question
to see how the question is formatted in LLaVA-VL repo and the print the result of apply_chat_template
from below line to compare if they match
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What are these?"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
Ahhh still not working.
This time I tried to convert the llava-onevision-qwen2-0.5b-ov
on an 80G A100 machine.
Maybe because I adopted the same hardware as yours, I passed the test without any modification to expected_slice
in convert_llava_onevision_weights_to_hf.py.
The generated_text
is:
(for better visual effects, I removed prompt info and add a \n
to break the line manually )
The image is a radar chart that compares the performance of different models in a specific task, likely related to natural language processing or machine learning.
The chart is divided into several axes, each representing a different model or method. The models include BLIP-2, InstructBLIP, Qwen-VL-Chat, and LLaVA-1.5. The radar chart shows the performance scores for each model, with the scores ranging from 1 to 80. The models are evaluated
The expected_text
is:
The image is a radar chart that compares the performance of different models in a specific task, likely related to natural language processing or machine learning.
The chart is divided into different categories, each represented by a different color and labeled with the name of the model or technique used. The models are evaluated based on their performance metrics, such as BLEU-2, InstructBLIP, Qwen-VL-Chat, and LLaVA-1.5. The radar chart helps to visualize the relative
For contrast, the output text I get from LLaVA-VL's notebook is:
The image is a radar chart that compares the performance of different models in terms of their accuracy and reliability.
The chart shows various benchmarks such as BLIP-2, InstructBLIP, Qwen-VL-Chat, PODE-Bench, MM-Bench, and SEED-Bench, along with other models like VQA (Visual Question Answering), GQA (General Knowledge Question Answering), and SQA-IMG (Squad Image Question Answering). Each model's performance is
In my opinion
generated_text
is OK not to be the same as expected_text
because we may have different CUDA versions or other environment things. generated_text
should be the same as LLaVA-VL's output, but in fact they are not.To find the reason, I investigated more.
I saved the output logits in the LLaVA-VL code.
# LLaVA-OneVision Tutorial
with torch.no_grad():
outputs = model(
input_ids=input_ids,
images=image_tensor,
image_sizes=image_sizes,
output_hidden_states=True,
output_attentions=False
)
logits = outputs.logits
torch.save(outputs,"target.pt")
and load it in the conversion script:
# convert_llava_onevision_weights_to_hf.py
# verify single forward pass
print("Single forward pass")
with torch.inference_mode():
inputs = inputs.to(device)
outputs = model(**inputs)
print("Shape of logits:", outputs.logits.shape)
print("First values of logits:", outputs.logits[0, :3, :3])
target = torch.load("target.pt")
print("Target Logits", target.logits)
print("Output Logits", outputs.logits)
assert torch.allclose(outputs.logits[0, :3, :3], target.logits[0, :3, :3], atol=1e-4)
print("Slice Assertion Passed")
assert torch.allclose(outputs.logits, target.logits, atol=1e-4)
print("Full Assertion Passed")
by running the converion script, I got
Single forward pass
Shape of logits: torch.Size([1, 6578, 152000])
First values of logits: tensor([[-12.0234, -14.3828, -12.7500],
[ 2.3594, 1.0000, 3.9336],
[ 3.6582, 4.7148, 9.1172]], device='cuda:0', dtype=torch.float16)
Target Logits tensor([[[-12.0234, -14.3828, -12.7500, ..., 9.0469, 8.8906, 15.0859],
[ 2.3594, 1.0000, 3.9336, ..., 8.4688, 14.5156, -2.7188],
[ 3.6582, 4.7148, 9.1172, ..., 4.8672, 2.9141, -4.6914],
...,
[ 0.3838, -2.3594, -0.6416, ..., 14.5859, 11.2109, 2.6992],
[ -1.1221, -0.2347, -1.3438, ..., 11.0938, 13.5469, 0.5312],
[ 4.8867, 14.5312, 10.2812, ..., 6.6719, 9.5391, -0.4631]]],
device='cuda:0', dtype=torch.float16)
Output Logits tensor([[[-12.0234, -14.3828, -12.7500, ..., 9.3047, 9.2812, 9.2969],
[ 2.3594, 1.0000, 3.9336, ..., -1.7090, -1.7109, -1.7070],
[ 3.6582, 4.7148, 9.1172, ..., -2.5684, -2.5723, -2.5801],
...,
[ 0.3813, -2.3613, -0.6250, ..., 1.5645, 1.5615, 1.5654],
[ -1.1162, -0.2271, -1.3398, ..., 0.4922, 0.4626, 0.4805],
[ 4.8867, 14.5469, 10.2969, ..., 0.4358, 0.4336, 0.4084]]],
device='cuda:0', dtype=torch.float16)
Slice Assertion Passed
Traceback (most recent call last):
File "/root/autodl-tmp/convert.py", line 406, in <module>
convert_llava_to_hf(args.model_id, args.pytorch_dump_folder_path, args.push_to_hub)
File "/root/autodl-tmp/convert.py", line 239, in convert_llava_to_hf
assert torch.allclose(outputs.logits, target.logits, atol=1e-4)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The size of tensor a (152000) must match the size of tensor b (151647) at non-singleton dimension 2
which means although they have the same [0, :3, :3]
slice, the other parts have difference.
What kind of reason may lead to this situation? Could there be any problems during the conversion process?
Hmm, afaik the llava implementation defaults to FA2 when loading the model + we need to check the dtypes when loading the model. I think that could be the reason why logits are slightly different and why generated text starts diverging as more tokens are generated
How much is the diff for the logits if we don't take it as a slice but compare the whole tensor? If it is below tolerance level, I think we should be good even if generated texts don't match 100%. The difference in that case if most probably is in the way model is loaded
afaik the llava implementation defaults to FA2 when loading the model + we need to check the dtypes when loading the model.
Yes, but actually, I'm using sdpa
as the attention implementation in LLaVA-OneVision-Tutorial code.
I'm not familiar with the dtype change process, so I'm afraid that I can't provide much help about this.
How much is the diff for the logits if we don't take it as a slice but compare the whole tensor? If it is below tolerance level
From the results I just printed, the deviation seems quite significant. We can examine the first row of each tensor.
Target: [-12.0234, -14.3828, -12.7500, ..., 9.0469, 8.8906, 15.0859]
Generated: [-12.0234, -14.3828, -12.7500, ..., 9.3047, 9.2812, 9.2969]
It can be seen that the differences between the last few values are quite large. I'm not sure how these differences will influence the final output.
Besides, the lengths of logits are also different. Perhaps this is due to addtional special tokens introduced during the conversion process.
Maybe we can test the converted model on some benchmarks such as MMMU to evaluate possible accuracy loss.
I'm not familiar with the dtype change process, so I'm afraid that I can't provide much help about this.
It can be set when loading the model with XXXModel.from_pretrained(model_id, torch_dtype="float16")
. I guess the llava repo uses bf16 but I might be mistaken.
Besides, the lengths of logits are also different. Perhaps this is due to addtional special tokens introduced during the conversion process.
Oh yeah, you're right, it also influences. Can you see the (logit_llava - logit_converted[:, :-1, :]).abs().max()
by cropping the last token from converted model logits? Or maybe more than one token if we add more?
Maybe we can test the converted model on some benchmarks such as MMMU to evaluate possible accuracy loss.
IMO this is too much for simply converting the weights. What we usually do is to match to logits and the generated results. I'd love to help you out with debugging, but I might be slow this week. If you have a working branch feel free to open a PR so we can run and test it out together.
In general, I am okay as long as the logits match, the generation should match in that case also unless the inputs are formatted differently :)
It can be set when loading the model with
XXXModel.from_pretrained(model_id, torch_dtype="float16")
. I guess the llava repo uses bf16 but I might be mistaken.
I checked load_pretrained_model()
method used in LLaVA-VL
.
Unfortunately, the default torch_dtype
is indeed float16
, which is the same as in the conversion script:
# LLaVA-VL
def load_pretrained_model(
model_path: Any,
model_base: Any,
model_name: Any,
load_8bit: bool = False,
load_4bit: bool = False,
device_map: str = "auto",
torch_dtype: str = "float16",
attn_implementation: str = "flash_attention_2",
customized_config: Any | None = None,
overwrite_config: Any | None = None,
**kwargs: Any
) -> Any
Oh yeah, you're right, it also influences. Can you see the
(logit_llava - logit_converted[:, :-1, :]).abs().max()
by cropping the last token from converted model logits? Or maybe more than one token if we add more?
I tried, but the problem seems serious. The lengths of the 3rd dimension are totally different.
ipdb> logit_llava .shape
torch.Size([1, 6578, 151647])
ipdb> logit_converted.shape
torch.Size([1, 6578, 152000])
IMO this is too much for simply converting the weights. What we usually do is to match to logits and the generated results.
In general, I am okay as long as the logits match, the generation should match in that case also unless the inputs are formatted differently :)
I agree. πWe only need to match the logits.
I'd love to help you out with debugging, but I might be slow this week. If you have a working branch feel free to open a PR so we can run and test it out together.
Thanks for the help! π₯° However, I need to spare some effort to conduct other tasks as well.
I'll try to investigate more about the conversion process if I have time. Perhaps there are some minor mistakes causing the logits to not match.
I tried, but the problem seems serious. The lengths of the 3rd dimension are totally different.
Weird, this should not happen because we don't resize dimensionality of lm-head. What can be different is the second-dim for token length depending on how we tokenized/formatted inputs. Yes, in that case I agree conversion is doing smth wrong
No worries, take your time and lmk if you need further assistance π€
Hi @FuryMartin, I am also trying to convert llava-critic to hf version. Could you share the modification for implementing this. BTW, I need to convert many local llava-ov models into hf version. I am wondering if there any big differences.
Hi @FuryMartin, I am also trying to convert llava-critic to hf version. Could you share the modification for implementing this. BTW, I need to convert many local llava-ov models into hf version. I am wondering if there any big differences.
Hi, the conversion is easy. The core code to convert a llava-ov model is : https://github.com/huggingface/transformers/blob/86701f2b6ff2085a3cd3ad1d30bc2ff2b10fbd94/src/transformers/models/llava_onevision/convert_llava_onevision_weights_to_hf.py#L98-L196
You only need to add an extra line to the text_model_id specifying process. For example, adding lmms-lab/llava-critic-7b
to the following block.
elif model_id in [
"lmms-lab/llava-onevision-qwen2-7b-ov",
"lmms-lab/llava-onevision-qwen2-7b-si",
"lmms-lab/llava-onevision-qwen2-7b-ov-chat",
"lmms-lab/llava-critic-7b"
]:
text_model_id = "Qwen/Qwen2-7B-Instruct"
By running the function, you will get a converted model under folder output/
.
convert_llava_to_hf("lmms-lab/llava-critic-7b", "output/")
However, as we discussed above, we think that the current conversion has some problem, which may result in accuracy drop compared to the original model. You can use the converted model at your own risks.
System Info
transformers
version: 4.46.0Who can help?
@zucchini-nlp
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I copied convert_llava_onevision_weights_to_hf.py as
convert.py
, and run:Then I encountered an assertion error; it appears that the logits produced by the converted model do not match those specified in the script.
lmms-lab/llava-onevision-qwen2-0.5b-ov
output:lmms-lab/llava-onevision-qwen2-7b-ov
output:Expected behavior
The output logits remain consistent and do not produce an assertion error.