Does this mean that the FashionCLIP image encoder divides the image in 50 patches? Why isn't the original architecture of CLIP kept for FashionCLIP?
Also, are these the patch embeddings? Or are they a result after projection layers that reduce dimensionality to 768? Because the original CLIP model outputs dimension 1024 embeddings right after the Vision Transformer Encoder
When loading the FashionCLIP model from HF using only the image encoder, like this:
This outputs an array of shape (1, 50, 768).
Two concerns here:
Thanks