Closed Originlightwkp closed 1 year ago
Hello!
the output shape generally depends on the model you are using. CLIP and fine-tuned versions have a final projection layer at the end of the two modality-specific backbones that give you a single vector for each sentence.
That is why given a sentence, you get [1, 512]
Our implementation follows the one HuggingFace provides, so it should be consistent with most CLIPs built in the same way.
Not sure if this answers your question but happy to add more to this!
very thanks
Hello, I saw that other versions of clip convert a sentence to a unique numeric encoding when using the txt encoder, with a shape of [batch, length], where length is the length of the numeric encoding. For example, the code I used converts a sentence to a length 13 encoding. How does your model use txt input? I saw that the shape of the final text feature in other models is [batch, length, dim], while your model output is [batch, dim]