Open claudiogreco opened 2 years ago
logits = coca(
text = text,
images = images
) # (4, 512, 20000)
I also have the same question. Although the caption logits can be obtained using the above code, text_tokens cannot be obtained and only image_tokens can be used in the inference phase.
Thank you in advance.
Same problem here, with logits i get a huge tensor, but i didn't figure out how to convert it to text.
Hello,
Thank you for having implemented this model. Have you already implemented some code to generate the caption of a given image? If not, do you have an idea about how you would do it in this particular architecture?
Thank you in advance.