Closed vishaal27 closed 1 year ago
Yes, this should be easy to compute. All you would need to do is pass in the appropriate input (i.e., interleaved image + text + the target option) as the labels
argument in the forward pass. Since we use the HuggingFace endpoint (https://github.com/kohjingyu/fromage/blob/main/fromage/models.py#L264), I think that output.loss
will already give you the negative of the log-likelihood (so the option with the lowest loss will be the answer).
Alternatively, you can add a line to generate_for_images_and_texts()
to return the loss rather than the embeddings. This might be easier if you have some non-standard input, since you can pass in any arbitrarily interleaved list of PIL.Images
and str
objects. You would probably just need to specify num_words=0
in the arguments, and edit this line to return outputs.loss
rather than the generated text + images:
Hope that makes sense!
Great, thanks -- this is super helpful!
When I try the second option you suggested, the outputs.loss
is None. I specify num_words=0
and return outputs.loss, but this doesn't seem to work. My input to the generate_for_images_and_texts()
is the entire interleaved list of PIL images and strs. Could you let me know what the issue might be? I think the problem is the labels
argument to the lm
is not set in this call, but I am not able to figure out what exactly I should set it to?
Ah I see, you're right. I think you will need to pass in a labels
tensor which contains the tokenizer()
outputs for the text, and -100
for image embeddings. Something like this:
So for example, a sequence of <image>a lazy cat<image>a happy dog
would be encoded as something like [-100, 2, 102, 22414, 4758, -100, 102, 1372, 2335]
(from the OPT tokenizer) if each image is encoded as a single vector, which the base model is (there is another vis4 model that embeds them as 4 vectors, in which case it would be [-100, -100, -100, -100]
instead). And you can compute the loss with that sequence.
I think that should work, but please let me know if it doesn't.
Thanks @kohjingyu, I was able to get the output log lik scores using your suggested method. This is my method inside FromageModel
, could you please check if this looks about right? [Note: This is for the base model, so one token per image.]
def get_log_lik_scores(
self, prompts: List):
"""
Output the log likelihoods of the given interleaved prompts.
Args:
prompts: List of interleaved PIL.Image.Image and strings representing input to the model.
Returns:
log lik score of prompt sequence.
"""
input_embs = []
input_ids = []
add_bos = True
for i, p in enumerate(prompts):
if type(p) == Image.Image:
# Encode as image.
pixel_values = utils.get_pixel_values_for_model(self.model.feature_extractor, p)
pixel_values = pixel_values.to(device=self.model.logit_scale.device, dtype=self.model.logit_scale.dtype)
pixel_values = pixel_values[None, ...]
visual_embs = self.model.get_visual_embs(pixel_values, mode='captioning') # (1, n_visual_tokens, D)
input_embs.append(visual_embs)
id_ = torch.tensor([-100], dtype=torch.int64).to(self.model.logit_scale.device).unsqueeze(0)
input_ids.append(id_)
elif type(p) == str:
text_ids = self.model.tokenizer(p, add_special_tokens=True, return_tensors="pt").input_ids.to(self.model.logit_scale.device)
if not add_bos:
# Remove <bos> tag.
text_ids = text_ids[:, 1:]
else:
# Only add <bos> once.
add_bos = False
text_embs = self.model.input_embeddings(text_ids) # (1, T, D)
input_embs.append(text_embs)
input_ids.append(text_ids)
else:
raise ValueError(f'Input prompts should be either PIL.Image.Image or str types, got {type(p)} instead.')
input_embs = torch.cat(input_embs, dim=1)
input_ids = torch.cat(input_ids, dim=1)
outputs = self.model.lm(inputs_embeds=input_embs, labels=input_ids, use_cache=False, output_hidden_states=True)
return -outputs.loss.item()
Maybe another high-level question is: Now that I can compute these scores for any interleaved sequences, do you think length normalisation of the log-likelihood scores would be an important factor for comparing across sequences?
For example, if I am doing ImageNet evaluation, I would have sequences like <image> This is a photo of a tick
vs <image> This is a photo of a Mexican hairless dog (xoloitzcuintli)
, would you normalise the log-likelihood scores by the length of the classnames for a fair comparison?
That looks good! Only thing I would change is:
id_ = torch.zeros(visual_embs.shape[:2], dtype=torch.int64).to(visual_embs.device) - 100
This will generalize when visual_embs
is not a vector (one of the checkpoints we released has visual_embs
as 4 vectors).
Maybe another high-level question is: Now that I can compute these scores for any interleaved sequences, do you think length normalisation of the log-likelihood scores would be an important factor for comparing across sequences?
In my experience measuring FROMAGe on things like VQA, normalization doesn't seem to be super important, and results appear mostly similar. I think it's mostly an empirical question, so if it's not too difficult I'd just try both.
Also: would you be interested in opening a pull request with adding this functionality into models.py
? I think it'd be a great addition. No worries if not, and I can also do it myself if it's ok with you. Thanks for looking into this!
Great thanks, opened one here: #14
Hi, is it possible to get the tokenwise log-likelihood scores of different outputs from the model?
The use-case would be something like: Given an interleaved image/text input and a list of output text candidates, we should be able to get a score for each output candidate and then return their ranked list, rather than generating the outputs directly. This would be close to how LLMs are evaluated on MCQ tasks. An example from the T0 paper Page 6 (https://arxiv.org/pdf/2110.08207.pdf):
Is it straightforward to do this with Fromage? I assume with the model
forward
function at inference (haven't dug into this yet)?