[Question] Getting output likelihood scores from the model

haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

https://llava.hliu.cc

Apache License 2.0

19.22k stars 2.11k forks source link

[Question] Getting output likelihood scores from the model #108

Open vishaal27 opened 1 year ago

vishaal27 commented 1 year ago

Question

Hi, is it possible to get the tokenwise log-likelihood scores of different outputs from the model?

The use-case would be something like: Given an interleaved image/text input and a list of output text candidates, we should be able to get a score for each output candidate and then return their ranked list, rather than generating the outputs directly. This would be close to how LLMs are evaluated on MCQ tasks. An example from the T0 paper Page 6 (https://arxiv.org/pdf/2110.08207.pdf):

For tasks that involve choosing the correct completion from several options (e.g. multiple choice
question answering), we follow Brown et al. (2020) and use rank classification to evaluate our
model: we compute the log-likelihood of each of the target options under the fine-tuned model and
select the option with the highest log-likelihood as the prediction. For simplicity, we do not apply
length normalization to the log-likelihoods of the target options.

Is it straightforward to do this with LLaVA? I assume since the LM is built with transformers there should be a possibility to use output score functions already implemented (haven't dug into this yet)?

haotian-liu commented 1 year ago

Hi @vishaal27, thank you for the great question. Yes it is easy to do this with LLaVA.

Here is a simple example that you may start with, by inserting this into run_llava.py:

generation_output = model.generate(
    input_ids,
    images=image_tensor.unsqueeze(0).half().cuda(),
    do_sample=True,
    temperature=0.2,
    max_new_tokens=1024,
    stopping_criteria=[stopping_criteria],
    # add following two lines
    return_dict_in_generate=True,
    output_scores=True
)

input_token_len = input_ids.shape[1]
output_ids = generation_output.sequences[0, input_token_len:]
output_scores = generation_output.scores

vishaal27 commented 1 year ago

Thanks @haotian-liu, but as I understand it, this will return the log-likelihood of the generated output given some initial prompt right? I don't want to generate more tokens but rather evaluate the likelihood of a given token sequence under the model, for example if I want to do ImageNet classification with this model I would do something like: evaluate the log-likelihood of the sequence <image> This is a photo of a {CLASS} where I would iterate over all classnames and replace {CLASS} appropriately, and then take the argmax of the log-likelihoods of each class. I guess the code you provided would generate more tokens on top of the <image> This is a photo of a {CLASS} sequence, and then return the log-likelihood of the entire sequence right? Please correct me if I misunderstood something, thanks!

haotian-liu commented 1 year ago

@vishaal27 I think this is also possible. Consider this following (pseudo) code:

message = """Human: <image> what is the object in the photo?
GPT: This is a photo of a """
input_ids = tokenizer(message)

The first output token should be the CLASS if it is a single-token word, and you can obtain the log likelihood with the code above. Please correct me if I misunderstand anything, thanks.

copperwiring commented 3 months ago

@vishaal27 Did you find the solution to your problem (I know its an old issue). I have a similar issue. I have a set of possible options and I want to computer log prob of those options as the output. When using prompt based method, tokens are generated. In case of single word output, I still get prob of one output not the distribution ob prob over all my possible options (which in your case were classes I think). How did you resolve it?

vishaal27 commented 3 months ago

This code should work:

from llava.constants import (
    IMAGE_TOKEN_INDEX,
    DEFAULT_IMAGE_TOKEN,
    DEFAULT_IM_START_TOKEN,
    DEFAULT_IM_END_TOKEN,
    IMAGE_PLACEHOLDER,
)
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init
from llava.mm_utils import (
    process_images,
    tokenizer_image_token,
    get_model_name_from_path,
    KeywordsStoppingCriteria,
)
from PIL import Image
import requests
from PIL import Image
from io import BytesIO
import re
import torch
import numpy as np

def image_parser(image_file):
    out = image_file.split(',')
    return out

def load_image(image_file):
    if image_file.startswith("http") or image_file.startswith("https"):
        response = requests.get(image_file)
        image = Image.open(BytesIO(response.content)).convert("RGB")
    else:
        image = Image.open(image_file).convert("RGB")
    return image

def load_images(image_files):
    out = []
    for image_file in image_files:
        image = load_image(image_file)
        out.append(image)
    return out

def count_all_parameters(model):
    return sum(p.numel() for p in model.parameters())

def eval_model(model_path, image_file, query, options):
    # Model
    disable_torch_init()

    model_name = get_model_name_from_path(model_path)
    tokenizer, model, image_processor, context_len = load_pretrained_model(
        model_path, None, model_name
    )

    qs = query
    image_token_se = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN
    if IMAGE_PLACEHOLDER in qs:
        if model.config.mm_use_im_start_end:
            qs = re.sub(IMAGE_PLACEHOLDER, image_token_se, qs)
        else:
            qs = re.sub(IMAGE_PLACEHOLDER, DEFAULT_IMAGE_TOKEN, qs)
    else:
        if model.config.mm_use_im_start_end:
            qs = image_token_se + "\n" + qs
        else:
            qs = DEFAULT_IMAGE_TOKEN + "\n" + qs

    if "llama-2" in model_name.lower():
        conv_mode = "llava_llama_2"
    elif "v1" in model_name.lower():
        conv_mode = "llava_v1"
    elif "mpt" in model_name.lower():
        conv_mode = "mpt"
    else:
        conv_mode = "llava_v0"

    conv = conv_templates[conv_mode].copy()
    conv.append_message(conv.roles[0], qs)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()

    image_files = image_parser(image_file)
    images = load_images(image_files)
    images_tensor = process_images(
        images,
        image_processor,
        model.config
    ).to(model.device, dtype=torch.float16)

    log_lik_scores = []

    for option in options:

        target_prompt = prompt + ' ' + option
        print(target_prompt)

        input_ids = (
            tokenizer_image_token(target_prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
            .unsqueeze(0)
            .cuda()
        )
        attention_mask = torch.ones_like(input_ids)

        with torch.inference_mode(), torch.cuda.amp.autocast():
            outputs = model.forward(
                input_ids=input_ids,
                labels=input_ids,
                attention_mask=attention_mask,
                images=images_tensor,
                )

        log_lik_scores.append((option, -outputs.loss.item()))

    pred_id = np.argmax(np.asarray([x[1] for x in log_lik_scores]))
    print(log_lik_scores)
    print('Prediction: {}'.format(log_lik_scores[pred_id]))

if __name__ == '__main__':    

    model_path = "liuhaotian/llava-v1.5-13b"

    prompt = "Describe the image."
    image_file = "https://llava-vl.github.io/static/images/view.jpg"

    shared_prompt = 'This is an image of a '
    options = [shared_prompt+x for x in ['horse', 'lion', 'tiger', 'elephant', 'eagle', 'dog']]

    eval_model(model_path, image_file, prompt, options)

copperwiring commented 3 months ago

Thanks, @vishaal27 It was helpful!

copperwiring commented 3 months ago

@vishaal27 Though the answers are correct, I am surprised that probabilities of all options are so close to each other. I computed log likehood and probs

Prompt was slightly different but same image and I had these options: ['cat', 'river', 'dog', 'Invalid option'] and got following outputs

Log likelihood scores:
Assistant: If had to select one of the options, my answer would be cat: -3.405327081680298
Assistant: If had to select one of the options, my answer would be river: -3.405212163925171
Assistant: If had to select one of the options, my answer would be dog: -3.413139581680298
Assistant: If had to select one of the options, my answer would be Invalid option: -3.4227676391601562
**************************************************
Probabilities:
Assistant: If had to select one of the options, my answer would be cat: 0.2515695733004952
Assistant: If had to select one of the options, my answer would be river: 0.251598484772306
Assistant: If had to select one of the options, my answer would be dog: 0.2496118433492265
Assistant: If had to select one of the options, my answer would be Invalid option: 0.24722009857797245
Prediction: Assistant: If had to select one of the options, my answer would be river with probability 0.251598484772306

Did you get similar scores too?

vishaal27 commented 3 months ago

That could potentially be because your prompts are too long? One option would be to length-normalize your log-likelihood scores with the number of tokens in the prompt. In my experiments this did not make too much of a difference, but if you expect your prompts to be too long or of significantly different token lengths I would recommend to use length-normalized log-likelihoods. For reference, you can see here: https://blog.eleuther.ai/multiple-choice-normalization/

copperwiring commented 3 months ago

Not really. I can change the prompt

    prompt = """Describe the image. \n\n"""
    shared_prompt = 'This is an image of a '
    options = [shared_prompt+x for x in  ['cat', 'river', 'dog', 'Invalid option']]

    eval_model(model_path, image_file, prompt, options)

and outputs are still similar (very close/uniform(=):


Log likelihood scores:
This is an image of a cat: -4.218678951263428
This is an image of a river: -4.173059940338135
This is an image of a dog: -4.236606121063232
This is an image of a Invalid option: -4.627951145172119
**************************************************
Probabilities:
This is an image of a cat: 0.2707795140599804
This is an image of a river: 0.2834183003354851
This is an image of a dog: 0.2659684569012282
This is an image of a Invalid option: 0.17983372870330633
Prediction: This is an image of a river with probability 0.2834183003354851

vishaal27 commented 3 months ago

Yes, however these look quite similar to the scores I was getting. One correction to my earlier comment: the scores are actually length-normalised since internally it uses nn.CrossEntropyLoss which by default has reduction='mean' set.

You could try checking the length-unnormalised scores by:

        log_lik_scores.append((option, -outputs.loss.item() * input_ids.shape[1]))

instead of

        log_lik_scores.append((option, -outputs.loss.item()))

However, I wouldn't expect to see too much of a difference. In general, these scores that you get seem similar to the scores I had from my experience.

SakuraTroyChen commented 2 months ago

Maybe it is better to use the output_scores to calculate the softmax scores?

def eval_relevance(args, tokenizer, model, image_processor):
    disable_torch_init()

    model_name = get_model_name_from_path(args.model_path)

    if "llama-2" in model_name.lower():
        conv_mode = "llava_llama_2"
    elif "mistral" in model_name.lower():
        conv_mode = "mistral_instruct"
    elif "v1.6-34b" in model_name.lower():
        conv_mode = "chatml_direct"
    elif "v1" in model_name.lower():
        conv_mode = "llava_v1"
    elif "mpt" in model_name.lower():
        conv_mode = "mpt"
    else:
        conv_mode = "llava_v0"

    if args.conv_mode is not None and conv_mode != args.conv_mode:
        print(
            "[WARNING] the auto inferred conversation mode is {}, while `--conv-mode` is {}, using {}".format(
                conv_mode, args.conv_mode, args.conv_mode
            )
        )
    else:
        args.conv_mode = conv_mode

    qs = args.query
    if args.image_file != "":
        image_token_se = (
            DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN
        )
        if IMAGE_PLACEHOLDER in qs:
            if model.config.mm_use_im_start_end:
                qs = re.sub(IMAGE_PLACEHOLDER, image_token_se, qs)
            else:
                qs = re.sub(IMAGE_PLACEHOLDER, DEFAULT_IMAGE_TOKEN, qs)
        else:
            if model.config.mm_use_im_start_end:
                qs = image_token_se + "\n" + qs
            else:
                qs = DEFAULT_IMAGE_TOKEN + "\n" + qs

        image_files = image_parser(args)
        images = load_images(image_files)
        image_sizes = [x.size for x in images]
        images_tensor = process_images(images, image_processor, model.config)
        if type(images_tensor) is list:
            for i in range(len(images_tensor)):
                images_tensor[i] = images_tensor[i].to(
                    model.device, dtype=torch.float16
                )
        else:
            images_tensor = images_tensor.to(model.device, dtype=torch.float16)
    else:
        images_tensor = None
        image_sizes = None

    conv = conv_templates[args.conv_mode].copy()
    conv.append_message(conv.roles[0], qs)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()

    input_ids = (
        tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
        .unsqueeze(0)
        .cuda()
    )
    with torch.inference_mode():
        generation_output = model.generate(
            input_ids,
            images=images_tensor,
            image_sizes=image_sizes,
            do_sample=True if args.temperature > 0 else False,
            temperature=args.temperature,
            top_p=args.top_p,
            num_beams=args.num_beams,
            max_new_tokens=args.max_new_tokens,
            use_cache=True,
            return_dict_in_generate=True,
            output_scores=True,
        )

    logits = generation_output.scores[0][0]

    probs = (
        torch.nn.functional.softmax(
            torch.tensor(
                [
                    logits[tokenizer("Yes").input_ids[1]],
                    logits[tokenizer("No").input_ids[1]],
                ]
            ),
            dim=0,
        )
        .detach()
        .cpu()
        .numpy()
    )

    return probs[0]

Just replace the tokens "Yes" and "No" with your options.

Stardust-y commented 2 months ago

Hi @vishaal27, thank you for the great question. Yes it is easy to do this with LLaVA.

Here is a simple example that you may start with, by inserting this into run_llava.py:

generation_output = model.generate(
    input_ids,
    images=image_tensor.unsqueeze(0).half().cuda(),
    do_sample=True,
    temperature=0.2,
    max_new_tokens=1024,
    stopping_criteria=[stopping_criteria],
    # add following two lines
    return_dict_in_generate=True,
    output_scores=True
)

input_token_len = input_ids.shape[1]
output_ids = generation_output.sequences[0, input_token_len:]
output_scores = generation_output.scores

I'm using the same script to get the likelihood score, the output text are correct, but the scores contain 'inf': tensor([[ -inf, -inf, -inf, ..., -inf, -inf, -inf], [ -inf, -inf, -inf, ..., -inf, -inf, -inf], [ -inf, -inf, -inf, ..., -inf, -inf, -inf], ..., [ -inf, -inf, 42.3438, ..., -inf, -inf, -inf], [ -inf, -inf, 48.3594, ..., -inf, -inf, -inf], [ -inf, -inf, 91.0938, ..., -inf, -inf, -inf]], device='cuda:0') tensor([[ -inf, -inf, -inf, ..., -inf, -inf, -inf], [ -inf, -inf, -inf, ..., -inf, -inf, -inf], [ -inf, -inf, -inf, ..., -inf, -inf, -inf], ..., [ -inf, -inf, -inf, ..., -inf, -inf, -inf], [ -inf, -inf, -inf, ..., -inf, -inf, -inf], [ -inf, -inf, 82.8125, ..., -inf, -inf, -inf]], device='cuda:0')

KatameRonin commented 1 month ago

Hi @vishaal27, thank you for the great question. Yes it is easy to do this with LLaVA. Here is a simple example that you may start with, by inserting this into run_llava.py:
generation_output = model.generate(
    input_ids,
    images=image_tensor.unsqueeze(0).half().cuda(),
    do_sample=True,
    temperature=0.2,
    max_new_tokens=1024,
    stopping_criteria=[stopping_criteria],
    # add following two lines
    return_dict_in_generate=True,
    output_scores=True
)

input_token_len = input_ids.shape[1]
output_ids = generation_output.sequences[0, input_token_len:]
output_scores = generation_output.scores
I'm using the same script to get the likelihood score, the output text are correct, but the scores contain 'inf': tensor([[ -inf, -inf, -inf, ..., -inf, -inf, -inf], [ -inf, -inf, -inf, ..., -inf, -inf, -inf], [ -inf, -inf, -inf, ..., -inf, -inf, -inf], ..., [ -inf, -inf, 42.3438, ..., -inf, -inf, -inf], [ -inf, -inf, 48.3594, ..., -inf, -inf, -inf], [ -inf, -inf, 91.0938, ..., -inf, -inf, -inf]], device='cuda:0') tensor([[ -inf, -inf, -inf, ..., -inf, -inf, -inf], [ -inf, -inf, -inf, ..., -inf, -inf, -inf], [ -inf, -inf, -inf, ..., -inf, -inf, -inf], ..., [ -inf, -inf, -inf, ..., -inf, -inf, -inf], [ -inf, -inf, -inf, ..., -inf, -inf, -inf], [ -inf, -inf, 82.8125, ..., -inf, -inf, -inf]], device='cuda:0')

The scores that you get are of the shape of : [ num_tokens_in_generate.sequences - 1, batch, vocab_size] And the -inf you see is the score associated with different vocab at each token position. Essentially when the softmax is taken after this step all of these -inf scores will be 0.