mlfoundations / open_clip

An open source implementation of CLIP.
Other
10.12k stars 971 forks source link

Padding text inputs to `TextTransformation` results in incorrect captions #588

Open dsikka opened 1 year ago

dsikka commented 1 year ago

Hello,

I am trying to run the caption generation workflow and was wondering what I have to do if the inputs to the TextTransformer model are always padded to a fixed length? Padding the input with the pad_token_id results in nonsensical captions.

How should the attn mask be updated in both the TextTransformer and the MultiModalDecoder? Currently, the input to the TextTransformer increases as the caption is generated but I'd like to pad the input to a fixed length.

Thanks.

@lucidrains @gpucce @iejMac

gpucce commented 1 year ago

hi @dsikka, can I ask you which generation type are you using? and how are you padding?

There should be fixed length argument that the generator can use, these are the arguments, https://github.com/mlfoundations/open_clip/blob/67e5e5ec8741281eb9b30f640c26f91c666308b7/src/open_clip/coca_model.py#L169-L185

using fix_output_length=True should give you output of the same length, however if you can explain a bit more how you would like things to work, maybe with an example I can help you further if this is not what you are looking for

dsikka commented 1 year ago

hi @dsikka, can I ask you which generation type are you using? and how are you padding?

There should be fixed length argument that the generator can use, these are the arguments,

https://github.com/mlfoundations/open_clip/blob/67e5e5ec8741281eb9b30f640c26f91c666308b7/src/open_clip/coca_model.py#L169-L185

using fix_output_length=True should give you output of the same length, however if you can explain a bit more how you would like things to work, maybe with an example I can help you further if this is not what you are looking for

Hi thanks for the quick reply @gpucce

I am currently using beam_search and was referring to the input to the text model in the forward pass: https://github.com/mlfoundations/open_clip/blob/67e5e5ec8741281eb9b30f640c26f91c666308b7/src/open_clip/coca_model.py#L151

The text input has variable length as the caption is generated. I wanted to pad this input such that all calls to self.text() have an input of the same size on this line: https://github.com/mlfoundations/open_clip/blob/67e5e5ec8741281eb9b30f640c26f91c666308b7/src/open_clip/coca_model.py#L138

Possibly something like this, if we were to padd all inputs to length 15?

og_shape = text.shape[-1]
r = F.pad(text, (15 - (og_shape), 0))
text_latent, token_emb = self.text(r)

I was wondering how to do this correctly while also correctly updating the attn mask: https://github.com/mlfoundations/open_clip/blob/67e5e5ec8741281eb9b30f640c26f91c666308b7/src/open_clip/transformer.py#L604

dsikka commented 1 year ago

Hi, just wanted to follow-up on this?

@lucidrains @gpucce @iejMac

dsikka commented 1 year ago

@lucidrains @gpucce @iejMac