Unnormalized image and text projected vectors

moein-shariatnia / OpenAI-CLIP

Simple implementation of OpenAI CLIP model in PyTorch.

MIT License

574 stars 85 forks source link

Hi, thanks for open-sourcing your code. I noticed that your text and image vectors which you used to compute the logits are not unit normalized vectors. https://github.com/moein-shariatnia/OpenAI-CLIP/blob/e2c5bb3859d7478752af8c69862f63b1afe4a9cb/modules.py#L68 .

In this case, the two vectors can have arbitrary lengths and the dot product does not capture their cosine similarity as done in OpenAI's CLIP implementation. Do you have any intuition why you did not do L2 normalization instead of LayerNorm / why LayerNorm was your preferred choice?

moein-shariatnia / OpenAI-CLIP

Unnormalized image and text projected vectors #10