patrickjohncyh / fashion-clip

FashionCLIP is a CLIP-like model fine-tuned for the fashion domain.
MIT License
293 stars 34 forks source link

how txt input change in the model? #13

Closed Originlightwkp closed 1 year ago

Originlightwkp commented 1 year ago

Hello, I saw that other versions of clip convert a sentence to a unique numeric encoding when using the txt encoder, with a shape of [batch, length], where length is the length of the numeric encoding. For example, the code I used converts a sentence to a length 13 encoding. How does your model use txt input? I saw that the shape of the final text feature in other models is [batch, length, dim], while your model output is [batch, dim]

import torch import pandas as pd import numpy as np np.set_printoptions(precision=6,suppress=True)

import clip

from PIL import Image import requests from transformers import AutoTokenizer from transformers import CLIPProcessor, CLIPModel device = "cuda" if torch.cuda.is_available() else "cpu" mypath = "/media/cheng/dataset4/codeto2022/fashion-clip-master/modelsave" model = CLIPModel.from_pretrained(mypath).to(device) processor = CLIPProcessor.from_pretrained(mypath) tokenizer = AutoTokenizer.from_pretrained(mypath)

urls=['/media/cheng/dataset4/codeto2022/fashion-clip-master/images/16790484.jpg', '/media/cheng/dataset4/codeto2022/fashion-clip-master/images/16198646.jpg', '/media/cheng/dataset4/codeto2022/fashion-clip-master/images/nike_dress.jpg']

images=[Image.open(i) for i in urls]

input1 = tokenizer(["a photo of a Women wearing white backless long dresses"], padding=True, return_tensors="pt").to('cuda')

print(input1)#[[49406, 320, 1125, 539, 320, 1507, 3309, 1579, 1663, 1285, 1538, 8184, 49407]]

input2 = processor(images=images, return_tensors="pt").to('cuda')

text_features = model.get_text_features(input1) image_features = model.get_image_features(input2)

print(text_features.shape)#【1,512】 print(image_features.shape)#【1,512】**

vinid commented 1 year ago

Hello!

the output shape generally depends on the model you are using. CLIP and fine-tuned versions have a final projection layer at the end of the two modality-specific backbones that give you a single vector for each sentence.

That is why given a sentence, you get [1, 512]

Our implementation follows the one HuggingFace provides, so it should be consistent with most CLIPs built in the same way.

Not sure if this answers your question but happy to add more to this!

Originlightwkp commented 1 year ago

very thanks