haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.17k stars 2.22k forks source link

🐛 [BUG] llava-v1.6-mistral-7b fail to generate right response via 'mistral_instruct' template #1363

Open clownrat6 opened 7 months ago

clownrat6 commented 7 months ago

Description

I write a inference script like this:

import torch
from PIL import Image

import sys
sys.path.append('./')
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

def main():
    disable_torch_init()
    image = 'llava/serve/examples/extreme_ironing.jpg'
    inp = 'What is unusual about this image?'
    model_path = 'liuhaotian/llava-v1.6-mistral-7b'
    conv_mode = sys.argv[1]

    model_name = get_model_name_from_path(model_path)
    tokenizer, model, processor, _ = load_pretrained_model(model_path, None, model_name)
    conv = conv_templates[conv_mode].copy()
    roles = conv.roles

    image = Image.open(image)
    image_tensor = processor.preprocess(image, return_tensors='pt')['pixel_values']
    if type(image_tensor) is list:
        tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor]
    else:
        tensor = image_tensor.to(model.device, dtype=torch.float16)

    print(f"{roles[0]}: {inp}")
    inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()

    input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids,
            images=tensor,
            do_sample=True,
            temperature=0.2,
            max_new_tokens=1024,
            use_cache=True,
            stopping_criteria=[stopping_criteria])

    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).strip()
    print(outputs)

if __name__ == '__main__':
    main()

If I conduct this command python inference.py mistral_instruct, this code will generate empty output. If I conduct this command python inference.py llava_v1, this code will generate normal output:

city setting with traffic. It is also not typical to see someone standing on the back of a vehicle, as it can be dangerous and is generally not allowed. The man's actions are likely intended to be humorous or to draw attention to a specific cause or event. </s>
paralym commented 7 months ago

I tried to finetune llava-v1.6-mistral-7b with mistral_instruct template, but the output was not in the expected format. Have you figured out what template llava-v1.6-mistral-7b uses?

cooleel commented 6 months ago

I tried to finetune llava-v1.6-mistral-7b with mistral_instruct template, but the output was not in the expected format. Have you figured out what template llava-v1.6-mistral-7b uses?

Did you solve it? what version did you use in pretraining and finetuning btw?

paralym commented 6 months ago

I tried to finetune llava-v1.6-mistral-7b with mistral_instruct template, but the output was not in the expected format. Have you figured out what template llava-v1.6-mistral-7b uses?

Did you solve it? what version did you use in pretraining and finetuning btw?

The codebase does not support llava 1.6 training and I didn't solve it, but I'm going to work on this in the coming days. I use the latest code in finetuning llava-v1.6-mistral-7b

chanangad commented 5 months ago

I think llava-v1.6-mistral-7b model uses llava_llama_2 conversation template. You can try it out!