Open dsikka opened 1 year ago
hi @dsikka, can I ask you which generation type are you using? and how are you padding?
There should be fixed length argument that the generator can use, these are the arguments, https://github.com/mlfoundations/open_clip/blob/67e5e5ec8741281eb9b30f640c26f91c666308b7/src/open_clip/coca_model.py#L169-L185
using fix_output_length=True
should give you output of the same length, however if you can explain a bit more how you would like things to work, maybe with an example I can help you further if this is not what you are looking for
hi @dsikka, can I ask you which generation type are you using? and how are you padding?
There should be fixed length argument that the generator can use, these are the arguments,
using
fix_output_length=True
should give you output of the same length, however if you can explain a bit more how you would like things to work, maybe with an example I can help you further if this is not what you are looking for
Hi thanks for the quick reply @gpucce
I am currently using beam_search and was referring to the input to the text model in the forward pass: https://github.com/mlfoundations/open_clip/blob/67e5e5ec8741281eb9b30f640c26f91c666308b7/src/open_clip/coca_model.py#L151
The text
input has variable length as the caption is generated. I wanted to pad this input such that all calls to self.text()
have an input of the same size on this line: https://github.com/mlfoundations/open_clip/blob/67e5e5ec8741281eb9b30f640c26f91c666308b7/src/open_clip/coca_model.py#L138
Possibly something like this, if we were to padd all inputs to length 15?
og_shape = text.shape[-1]
r = F.pad(text, (15 - (og_shape), 0))
text_latent, token_emb = self.text(r)
I was wondering how to do this correctly while also correctly updating the attn mask: https://github.com/mlfoundations/open_clip/blob/67e5e5ec8741281eb9b30f640c26f91c666308b7/src/open_clip/transformer.py#L604
Hi, just wanted to follow-up on this?
@lucidrains @gpucce @iejMac
@lucidrains @gpucce @iejMac
Hello,
I am trying to run the caption generation workflow and was wondering what I have to do if the inputs to the TextTransformer model are always padded to a fixed length? Padding the input with the
pad_token_id
results in nonsensical captions.How should the attn mask be updated in both the TextTransformer and the MultiModalDecoder? Currently, the input to the TextTransformer increases as the caption is generated but I'd like to pad the input to a fixed length.
Thanks.
@lucidrains @gpucce @iejMac