Why do the implementation behaviors of official llava and transformers differ?

System Info

transformers version: 4.39.3
Platform: Linux-5.4.0-42-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.2
Accelerate version: 0.21.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

MyModel is llava from official:vicuna+clip+mmprojector。 convert script:

import argparse                                                                                                                                                          
import os                                                                                                                                                                

import torch                                                                                                                                                             
from huggingface_hub import hf_hub_download                                                                                                                              

from transformers import (                                                                                                                                               
    AddedToken,                                                                                                                                                          
    AutoConfig,                                                                                                                                                          
    AutoTokenizer,                                                                                                                                                       
    CLIPImageProcessor,                                                                                                                                                  
    LlavaConfig,                                                                                                                                                         
    LlavaForConditionalGeneration,                                                                                                                                       
    LlavaProcessor,                                                                                                                                                      
)                                                                                                                                                                        

KEYS_TO_MODIFY_MAPPING = {                                                                                                                                               
    "model.vision_tower.": "",                                                                                                                                           
    "model.mm_projector": "multi_modal_projector",                                                                                                                       
    "model": "model.model",                                                                                                                                              
    "vision_model.model": "vision_model",                                                                                                                                
    "lm_head": "language_model.lm_head",                                                                                                                                 
    "model.model": "language_model.model",                                                                                                                               
    "multi_modal_projector.0": "multi_modal_projector.linear_1",                                                                                                         
    "multi_modal_projector.2": "multi_modal_projector.linear_2",                                                                                                         
}                                                                                                                                                                        

def convert_state_dict_to_hf(state_dict):                                                                                                                                
    new_state_dict = {}                                                                                                                                                  
    for key, value in state_dict.items():                                                                                                                                
        for key_to_modify, new_key in KEYS_TO_MODIFY_MAPPING.items():                                                                                                    
            if key_to_modify in key:                                                                                                                                     
                key = key.replace(key_to_modify, new_key)                                                                                                                

        new_state_dict[key] = value                                                                                                                                      
    return new_state_dict
def convert_llava_llama_to_hf(                                                                                                                                           
    text_model_id, vision_model_id, output_hub_path, old_state_dict_id                                                                                                   
):
    torch.set_default_dtype(torch.float16) 
    text_config = AutoConfig.from_pretrained(text_model_id)

    tokenizer = AutoTokenizer.from_pretrained(text_model_id)
    tokenizer.add_tokens(
        AddedToken("<image>", special=True, normalized=False), special_tokens=True
    )
    tokenizer.add_special_tokens({"pad_token": "<pad>"})

    image_processor = CLIPImageProcessor.from_pretrained(vision_model_id)

    processor = LlavaProcessor(tokenizer=tokenizer, image_processor=image_processor) 

    # vision_config = CLIPVisionConfig.from_pretrained(vision_model_id)

    config = LlavaConfig(text_config=text_config)
    # config.pad_token_id = 32001

    model = LlavaForConditionalGeneration(config)

    # Pad to 64 for performance reasons
    pad_shape = 64

    state_dict_path = os.path.join(old_state_dict_id, "model_state_dict.bin")
    if not os.path.exists(state_dict_path):
        state_dict_path = hf_hub_download(old_state_dict_id, "model_state_dict.bin") 

    state_dict = torch.load(state_dict_path, map_location="cpu")
    state_dict = convert_state_dict_to_hf(state_dict)
    # 替换LLM模型的权重
    model.load_state_dict(state_dict, strict=True, assign=True)

    pre_expansion_embeddings = model.language_model.model.embed_tokens.weight.data
    mu = torch.mean(pre_expansion_embeddings, dim=0).float()
    n = pre_expansion_embeddings.size()[0] 
    sigma = ((pre_expansion_embeddings - mu).T @ (pre_expansion_embeddings - mu)) / n
    dist = torch.distributions.multivariate_normal.MultivariateNormal(
        mu, covariance_matrix=1e-5 * sigma 
    )

    # We add an image token so we resize the model
    model.resize_token_embeddings(config.text_config.vocab_size + 2, pad_shape)
    model.language_model.model.embed_tokens.weight.data[32000:] = torch.stack(
        tuple(
            (
                dist.sample()
                for _ in range(
                    model.language_model.model.embed_tokens.weight.data[32000:].shape[0]
                )
            )
        ),
        dim=0,
    )
    model.language_model.lm_head.weight.data[32000:] = torch.stack(
        tuple(
            (
                dist.sample()
                for _ in range(
                    model.language_model.lm_head.weight.data[32000:].shape[0]
                )
            )
        ),
        dim=0,
    )
    is_local = os.environ.get("USE_LOCAL", True)
    if isinstance(is_local, str):
        is_local = eval(is_local)
    if not is_local:
        model.push_to_hub(output_hub_path) 
        processor.push_to_hub(output_hub_path)
    else:
        model.save_pretrained(output_hub_path)
        processor.save_pretrained(output_hub_path)
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--text_model_id",
        help="Hub location of the text model",
    )
    parser.add_argument(
        "--vision_model_id",
        help="Hub location of the vision model",
    )
    parser.add_argument(
        "--output_hub_path",
        help="Location on the hub of the converted model",
    )
    parser.add_argument(
        "--old_state_dict_id",
        help="Location on the hub of the raw state dict of the original model. The filename needs to be `model_state_dict.bin`",
    )
    args = parser.parse_args()
    convert_llava_llama_to_hf(
        args.text_model_id,
        args.vision_model_id,
        args.output_hub_path,
        args.old_state_dict_id,
    )

if __name__ == "__main__":
    main()

I only modified a little bit of the code so that I could save the model directly locally instead of pushing it to the hf repository

prompt:"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n下面的文章描述了一个实验。阅读文章，然后按照以下说明进行操作。\n\nMadelyn在雪板的底部涂上了一层薄蜡，然后直接下坡滑行。然后，她去掉了蜡，再次直接下坡滑行。她重复了这个过程四次，每次都交替使用薄蜡或不使用薄蜡滑行。她的朋友Tucker计时每次滑行的时间。Madelyn和Tucker计算了使用薄蜡滑行和不使用薄蜡滑行时直接下坡所需的平均时间。\n图：滑雪板下坡。\n麦德琳和塔克的实验能最好回答哪个问题？\nA. 当麦德琳的雪板上有一层薄蜡或一层厚蜡时，它是否能在较短的时间内滑下山坡？\nB. 当麦德琳的雪板上有一层蜡或没有蜡时，它是否能在较短的时间内滑下山坡？\n请直接回答选项字母。 ASSISTANT:"

config:{'max_new_tokens': 1024, 'temperature': 0.0, 'top_p': None, 'num_beams': 1, 'use_cache': True tranformers:

class LLaVAModel(object):
def __init__(self, model_id, dtype=torch.float16):
    self.dtype = torch.float16
    self.model = LlavaForConditionalGeneration.from_pretrained(
        model_id, torch_dtype=self.dtype, low_cpu_mem_usage=True
    ).to(0)
    self.processor = AutoProcessor.from_pretrained(model_id)

def forward(self, image, prompt, **kwargs):
    if not isinstance(image, str):
        raw_image = image
    else:
        raw_image = Image.open(image)
    inputs = self.processor(prompt, raw_image, return_tensors="pt").to(
        0, self.dtype
    )
    breakpoint()
    output = self.model.generate(**inputs, **kwargs)
    return self.processor.decode(output[0][2:], skip_special_tokens=True)

here is output:

"chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:  \n下面的
文章描述了一个实验。阅读文章，然后按照以下说明进行操作。\n\nMadelyn在雪板的底部涂上了一层薄蜡，然后直接下坡滑行。然后，她去掉了蜡，再次直接下坡滑行。她重复了这个过程四次，每次都交替使用薄蜡或不使用薄蜡滑行。她的朋友Tucker计时每次滑行的时间。Madelyn和Tucker计算了使用薄蜡滑行和不使用薄蜡滑行时直接下坡所需的平均时间。\n图：滑雪板下坡。\n麦德琳和塔克的实验能最好回答哪个问题？\nA. 当麦德琳的雪板上有一层薄蜡或一层厚蜡时，它是否能在较短的时间内滑下山坡？\nB. 当麦德琳的雪板上有一层蜡或没有蜡时，它是否能在较短
的时间内滑下山坡？\n请直接回答选项字母。 ASSISTANT: B"

llava official:


class LLavaBaseModel(object):                                                                                                                                            
def __init__(self, **config):                                                                                                                                        
    model_path = config["model_path"]                                                                                                                                
    model_base = config.get("model_base", None)                                                                                                                      
    model_name = get_model_name_from_path(model_path)                                                                                                                
    self.args = config.get("args", None)                                                                                                                             
    self.tokenizer, self.model, self.image_processor, self.context_len = (                                                                                           
        load_pretrained_model(                                                                                                                                       
            model_path,                                                                                                                                              
            model_base,                                                                                                                                              
            model_name,                                                                                                                                              
            load_4bit=self.args.load_4bit,                                                                                                                           
            load_8bit=self.args.load_8bit,                                                                                                                           
        )                                                                                                                                                            
    )                                                                                                                                                                
    if USE_TRANSFORMER:                                                                                                                                              
        self.args.model_path = (                                                                                                                                     
            "/home/liushuai9/workspace/llava-seg/convert/tmp/llava-v1.5-7b-hf"                                                                                       
        )                                                                                                                                                            
        self.model = LLaVAModel(self.args.model_path)                                                                                                                

def conv_model(self, conv_mode, qs):                                                                                                                                 
    conv = conv_templates[conv_mode].copy()                                                                                                                          
    conv.append_message(conv.roles[0], qs)                                                                                                                           
    conv.append_message(conv.roles[1], None)                                                                                                                         
    prompt = conv.get_prompt()                                                                                                                                       
    return prompt                                                                                                                                                    

def text_to_tensor(self, prompt):                                                                                                                                    
    input_ids = (                                                                                                                                                    
        tokenizer_image_token(                                                                                                                                       
            prompt, self.tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt"                                                                                           
        )                                                                                                                                                            
        .unsqueeze(0)                                                                                                                                                
        .cuda()                                                                                                                                                      
    )                                                                                                                                                                
    return input_ids
def image_to_tensor(self, image: Image):                                                                                                                             
    image_tensor = process_images([image], self.image_processor, self.model.config)[                                                                                 
        0
    ]
    return image_tensor

def forward(self, input_ids, image):
    disable_torch_init()
    image_tensor = self.image_to_tensor(image)
    with torch.inference_mode():
        output_ids = self.model.generate(
            input_ids,
            images=image_tensor.unsqueeze(0).half().cuda(),
            image_sizes=[image.size],
            do_sample=True if self.args.temperature > 0 else False,
            temperature=self.args.temperature,
            top_p=self.args.top_p,
            num_beams=self.args.num_beams,
            max_new_tokens=1024,
            use_cache=True,
        )
    return output_ids

def decode_output(self, output_ids):
    outputs = self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)[ 
        0
    ].strip()
    return outputs

def get_config(self):
    if USE_TRANSFORMER:
        return self.model.model.config 
    else:
        return self.model.config
def pipeline(self, text, image, conv_model):
    prompt = self.conv_model(conv_model, text)
    if USE_TRANSFORMER:
        breakpoint()
        output = self.model.forward(
            image,
            prompt,
            max_new_tokens=1024,
            temperature=0.0,
            top_p=None,
            num_beams=1,
            use_cache=True,
        )
        return output
    else:
        input_ids = self.text_to_tensor(prompt)
        output_ids = self.forward(input_ids, image)
        output = self.decode_output(output_ids)
        return output

The code may seem a bit long, but it's simple - simply load the model and use the generate method to output the result. USETRANSFORMER is used to use different types of reasoning.llava official out:


I tried debugging the code and found that the Transformers process output looks like this:
prompt-->token----->--greedy_search--language_model->one_token,two_token ---> [token,one_token,two_token]-->decode
                                        |<-----------------------------------------------------|               |
                                        |<-------------------------------------------------------------------|
image-->embedding--->
llava official:
prompt-->token-->--greedy_search--->one_token,two_token,three_token ---> [one_token,two_token]-->decode
                                        |<----------------------------|               |                  |
                                        |<------------------------------------------|                  |
                                        |<-----------------------------------------------------------|

image-->embedding--->
this is tranformers:
![Screenshot_20240423_174550](https://github.com/huggingface/transformers/assets/11495161/50ab6265-e8c7-4d04-9e16-baf291bb7237)
llava official:
![Screenshot_20240423_175715](https://github.com/huggingface/transformers/assets/11495161/fa0ade4d-15dc-4398-b06d-bde62689e07a)

input_ids as input into model,but transformers input_ids is prompt's tokenize,so output is include prompt.

### Expected behavior

transformers:"chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:  \n下面的
文章描述了一个实验。阅读文章，然后按照以下说明进行操作。\n\nMadelyn在雪板的底部涂上了一层薄蜡，然后直接下坡滑行。然后，她去掉了蜡，再次直接下坡滑行。她重复了这个过程四次，每次都交替使用薄蜡或不使用薄蜡滑行。她的朋友Tucker计时每次滑行的时间。Madelyn和Tucker计算了使用薄蜡滑行和不使用薄蜡滑行时直接下坡所需的平均时间。\n图：滑雪板下坡。\n麦德琳和塔克的实验能最好回答哪个问题？\nA. 当麦德琳的雪板上有一层薄蜡或一层厚蜡时，它是否能在较短的时间内滑下山坡？\nB. 当麦德琳的雪板上有一层蜡或没有蜡时，它是否能在较短
的时间内滑下山坡？\n请直接回答选项字母。 ASSISTANT: B"
llava:B

huggingface / transformers

Why do the implementation behaviors of official llava and transformers differ? #30415

System Info

Who can help?

Information

Tasks

Reproduction