mlfoundations / open_clip

An open source implementation of CLIP.
Other
10.11k stars 968 forks source link

Dimension mismatch when using Coca for VQA task #516

Open jemmyshin opened 1 year ago

jemmyshin commented 1 year ago

I use generate endpoint to do VQA task in Coca model, but got this error:

image

It seems that this issue will not happen in beam_search mode but appear in top_k or top_p mode.

Also, when I change max_seq_len parameter in generate I got different outputs. For example: max_seq_len = 20 and generation_type = top_p will not raise this error message. However this will not work for max_seq_len = 78 and generation_type = top_p.

image

Am I use this in a wrong way?

gpucce commented 1 year ago

Hi @jemmyshin, I think there was an issue similar to this one that was fixed some time ago, any chance that you are using an older version? Otherwise this is a bug, I will check what the issue is.

jemmyshin commented 1 year ago

I used the code in Coca Colab so it should be 2.18.0

gpucce commented 1 year ago

Hi @jemmyshin, so indeed there is a little bug in some sense, however you can probably already do what you want, if I understand it without any changes in the codebase. In the meantime I will open a PR.

The reason that a longer max_seq_len throws an error is that the model is trained with a context length of 77 and it has a special token so using 76, the default, or less should be the way to go. However, what that parameter affects is only the context the model uses to generate not the length of the generation.

If I understand you are not getting an answer after your prompt, the reason for that is the tokenizer. if you replace text = ... with

text = open_clip.tokenize(["Question: what is the color of this billboard? Answer:"])
text = text[:, :torch.where(text == 0)[1][0] - 1]

you should get the answer after the prompt, the issue is that the tokenizer adds padding and end of text token by default, I will make a pr to fix this but you should be able to try with this already. Let me know if this actually works!

jemmyshin commented 1 year ago

Yes, that works for single batch, but probably not for batch_size > 1 since each question may have different length. Also, the output somehow concatenate the prompt and answer:

image

Is there a way to separate them automatically (if input text is not None)?

LixDemon commented 8 months ago

@jemmyshin Hi, can you share the full code of VQA in coca? Thanks!