openai / CLIP

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
MIT License
25.8k stars 3.3k forks source link

CLIP Training Code #83

Open vinson2233 opened 3 years ago

vinson2233 commented 3 years ago

Not really an issue, I just want to share my training code since some people still have some difficulties to write the training code. Just modify the code to suit your usage. Feel free to ask or point out any mistakes in my code.

# Latest Update : 18 July 2022, 09:55 GMT+7

# TO ADD :
# Gradient Checkpointing
# Filter out bias from weight decay
# Decaying learning rate with cosine schedule
# Half-precision Adam statistics
# Half-precision stochastically rounded text encoder weights were used

#BATCH_SIZE must larger than 1

device = "cuda:0" if torch.cuda.is_available() else "cpu" # If using GPU then use mixed precision training.
model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training

class image_title_dataset(Dataset):
    def __init__(self, list_image_path,list_txt):

        self.image_path = list_image_path
        self.title  = clip.tokenize(list_txt) #you can tokenize everything at once in here(slow at the beginning), or tokenize it in the training loop.

    def __len__(self):
        return len(self.title)

    def __getitem__(self, idx):
        image = preprocess(Image.open(self.image_path[idx])) # Image from PIL module
        title = self.title[idx]
        return image,title

# use your own data
list_image_path = ['folder/image1.jpg','folder2/image2.jpg'] 
list_txt = ['description for image1.jpg' , 'description for image2.jpg']
dataset = image_title_dataset(list_image_path,list_txt)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader

#https://github.com/openai/CLIP/issues/57
def convert_models_to_fp32(model): 
    for p in model.parameters(): 
        p.data = p.data.float() 
        p.grad.data = p.grad.data.float() 

if device == "cpu":
  model.float()
else :
  clip.model.convert_weights(model) # Actually this line is unnecessary since clip by default already on float16

loss_img = nn.CrossEntropyLoss()
loss_txt = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2) #Params used from paper, the lr is smaller, more safe for fine tuning to new dataset

# add your own code to track the training progress.
for epoch in range(EPOCH):
  for batch in train_dataloader :
      optimizer.zero_grad()

      images,texts = batch 

      images= images.to(device)
      texts = texts.to(device)

      logits_per_image, logits_per_text = model(images, texts)

      ground_truth = torch.arange(len(images),dtype=torch.long,device=device)

      total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2
      total_loss.backward()
      if device == "cpu":
         optimizer.step()
      else : 
        convert_models_to_fp32(model)
        optimizer.step()
        clip.model.convert_weights(model)

Code to save the model :

torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': total_loss,
        }, f"model_checkpoint/model_10.pt") #just change to your preferred folder/filename

Code to load the saved model :

model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training
checkpoint = torch.load("model_checkpoint/model_10.pt")

# Use these 3 lines if you use default model setting(not training setting) of the clip. For example, if you set context_length to 100 since your string is very long during training, then assign 100 to checkpoint['model_state_dict']["context_length"] 
checkpoint['model_state_dict']["input_resolution"] = model.input_resolution #default is 224
checkpoint['model_state_dict']["context_length"] = model.context_length # default is 77
checkpoint['model_state_dict']["vocab_size"] = model.vocab_size 

model.load_state_dict(checkpoint['model_state_dict'])

Alternative training code :

vinson2233 commented 3 years ago

@minmummax this is the base code I'm using for private dataset which consist more than 1 million image text pairs. The actual code I'm using involves multi-GPU training

vinson2233 commented 3 years ago

@liuhui0401 for my case, it starts with 3, but falls quickly to 1.5, and after 8000 steps (not epoch) the loss oscilating around 0.5

lr19960813 commented 3 years ago

Hi,vinson, thank you for your code. Do you try using a learning rate decay? I think it will get a better result.

vinson2233 commented 3 years ago

@lr19960813 no, I haven't try it yet. The learning rate decay they use is this right ? https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingWarmRestarts.html?highlight=cosine#torch.optim.lr_scheduler.CosineAnnealingWarmRestarts

Maybe I'll dive deeper on this.

lr19960813 commented 3 years ago

@vinson2233 Yes, that's one type lr scheduler. In my experiment, I use roam scheduler to get a better result in a small test experiment. I will do more experiment to research the performence.

sarahESL commented 3 years ago

regarding the long text, yes, Clip text only accept token length 77 (1 token represent 1-3 characracter). It's better to trim your text first.

Isn't it possible to use shorter sequence lengths by adjusting the context_length in tokenize? I tried it, but still in encoding the text, it gives RuntimeError: The size of tensor a (_X_) must match the size of tensor b (77) at non-singleton dimension 1!

lr19960813 commented 3 years ago

Yes,so I think you should trim your text shorter than 77tokens.

Sedigheh (Sarah) Eslami @.***> 于2021年9月12日周日 下午4:23写道:

regarding the long text, yes, Clip text only accept token length 77 (1 token represent 1-3 characracter). It's better to trim your text first.

Isn't it possible to use shorter sequence lengths by adjusting the context_length in tokenize? I tried it, but still in encoding the text, it gives RuntimeError: The size of tensor a (X) must match the size of tensor b (77) at non-singleton dimension 1!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openai/CLIP/issues/83#issuecomment-917574766, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJQWBGXQCJM3FWI6GHYRHZTUBRIOBANCNFSM42SGLVGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

vinson2233 commented 3 years ago

@sarahESL yes you can, but you cannot use pre-trained CLIP if you want to use context_length different than 77. You might need to define CLIP by using this function over here https://github.com/openai/CLIP/blob/3b473b0e682c091a9e53623eebc1ca1657385717/clip/model.py#L239-L240 with fresh weight.

After that, you can use tokenize with context_length longer than 77 and expect CLIP model to able to accept it

jongwook commented 3 years ago

@HiGal The two losses are different because the softmax operations are done over different axes.

nbl97 commented 3 years ago

@vinson2233 Thanks for your multi-GPU code. However, I don't understand why the similarity matrix is 100x10 in your example if we don't separate the CLIP model. Could you pls explain more? If I want to use DDP, is it necessary to separate the model? Thanks a lot!

nbl97 commented 3 years ago

@vinson2233 Some supplement for the above question: I read the official code and found the the forward() function produces the cosine similarity, not the embedding. If I modify the forward() to produce the image and text embedding and calculate the similarity outside the forward(), is it still necessary to separate the CLIP model when multi-GPUs are to be used.

vinson2233 commented 3 years ago

@nbl97 Let's understand what DataParallel do. Suppose we have 10 records to inference and we have 2 GPU. Then what data parallel will do is it will create 5 duplicate of model in each GPU, then spread the 10 records to each device. Maybe it will go to something like this : GPU0 : record 0,1,2,3,4 GPU1 : record 5,6,7,8,9 After that on each device, the model will do inference and then concat all the output.

But for CLIP, this become a bit problematic. On a single model, if we give 10 pairs of text-images, the forward function will produce 10x10 similarity matrix. But if you work on 2 GPU, then the first GPU will receive first 5 pairs and the second GPU will receive the last 5 pairs. What happen in first GPU? It will calculate the cosine similarity between 5 pairs, and so does the second GPU. The result would be 5x5 cosine similarity matrix on each GPU. The last step is gathering the prediction from each GPU and we will end with 10x5 similarity matrix in this case.

Regarding DDP, i never use it so I cannot give answer on that 🙏

Regarding the modification, yes it should be possible theoretically. I have tried it in the past, but the parallelization only occur with forward method. Any other function will not get hooked with parallelization. That's why I come with the idea of splitting the model based on their responsibility.

xyxxmb commented 3 years ago

Have anyone tried fine-tune CLIP on MS COCO image-text retrieval task? How is the performance compared with other state-of-the-art models?

everything goes worse when I fine-tune the CLIP

From my experiment, the zero-shot image retrieval performance is R@1 25.4, R@5 48.7 and R@10 59.9 on the MS COCO 5k test set. After fine-tuning, it slightly improves to R@1 33.6, R@5 62.2 and R@10 73.8. Still lags behind SOTA non-transformer-based models(e.g., VSRN).

How did you get the image retrieval result in mscoco?Could you can share your test coding?

milmin commented 2 years ago

Hi @vinson2233 thanks a lot for sharing your code! If I understand well you never set the model in training phase model.train(). Do you do that on purpose to freeze dropout and batchnorm layers during fine-tuning? And second questions, do you have any ideas about good metrics to keep track of during evaluation to understand if fine-tuning is running fine (validation loss apart)?

Wuziyi616 commented 2 years ago

Hi @vinson2233 thank you so much for sharing the code. I'm just wondering what batch_size are you using to fine-tune CLIP? I'm using DDP training with 4 GPUS, and the batch_size is 128 on each GPU. However the loss isn't decreasing as expected... So I just wonder if that's because my batch_size if too small (so the fine-tuning is unstable) or there are bugs in my DDP code. Thanks in advance!

akamil-etsy commented 2 years ago

is it possible to fine-tune pre-trained CLIP model using a loss function different from the original (infoNCE)?

I found an example on Kaggle when CLIP is trained from scratch with triplet loss with pair-wise comparisons (vs. class labels), and I'm wondering if this approach would work for fine-tuning the original pre-trained model from OpenAI.

I.e. freeze lower layers of the model pre-trained with infoNCE and train only a few higher layers with triplet loss on custom data (triplets of images and/or captions) or circle loss instead of the original contrastive loss/infoNCE.

xyxxmb commented 2 years ago

@uplusv If you want to modify CLIP as a classifier(Single label, multi class), here's some modification you can do :

  1. Change the ground_truth = torch.arange(BATCH_SIZE).to(device) to integer vector that specify which class your image are on (for example torch.tensor([0,1,2,1,2,3,4,5])). With this now you can set your batch size in arbitrary size.
  2. One image should match 1 label, but 1 label can match will multiple images. You can omit the loss_txt in the total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2 to total_loss = loss_img(logits_per_image,ground_truth)

I'm not sure what you meant by "After fine-tuning, the model outputs sample feature for every image"

I don't think that the classifier idea (1) is correct. Assuming having three categories and five images, and the ground_truth = [0,0,1,2,1]. The logits_per_image would be a matrix A with shape (5,5),which A(i, j) denotes the similarity of image i and category j. When calculating the F.cross_entropy(logits_per_image, [0,0,1,2,1]), the nll_loss of 3th, 4th and 5th is not correct.

BrianG13 commented 2 years ago

@xyxxmb So, what is the best to do when I have in the same batch images with same category? For example, batch_size=5, labels/text-features are = [0,0,1,2,1] ground_truth = torch.arange(BATCH_SIZE).to(device) Is not the best solution also, because for image entries idx 0&1, I want them to be consistent with themselves (diagonal) and with the images of same class

Unmesh28 commented 2 years ago

I want to use CLIP for text to art using VQGAN. Planning to provide custom dataset to it [dataset : different painting styles of different artists ] If we send a prompt + style = "A girl playing on the beach" + "Thomas Kinkade" , then it should return art of the prompt in style of Thomas Kinkade. How do I train such a model on the top of existing one ?

dongyun-kim-arch commented 2 years ago

Thank you for sharing your training code! I am wondering how can I do fine-tuning with frozen all layers except for the top layer like linear probe?

Zasder3 commented 2 years ago

As a late addendum to this, open_clip recently has some large success in replicating OpenAI's ViT-B/32 model achieving near-identical accuracy on the public LAION-400M dataset.

image

Weifeng-Chen commented 2 years ago

【自动恢复】来信已收到,我将尽快回复你!

AhmedEwis commented 2 years ago

Hello my friend ValueError: Expected input batch_size (111) to match target batch_size (32) any idea why I am getting this error? def convert_models_to_fp32(model): for p in model.parameters(): p.data = p.data.float() p.grad.data = p.grad.data.float()

device = "cuda:0" if torch.cuda.is_available() else "cpu" # If using GPU then use mixed precision training. model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training if device == "cpu": model.float() else : clip.model.convert_weights(model) # Actually this line is unnecessary since clip by default already on float16

loss_img = nn.CrossEntropyLoss() loss_txt = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2) #Params used from paper, the lr is smaller, more safe for fine tuning to new dataset

for epoch in range(EPOCH): for batch in train_dataloader : optimizer.zero_grad()

  list_image,list_txt = np_list, txt_list #list_images is list of image in numpy array(np.uint8), or list of PIL images

  images= torch.stack([preprocess(Image.fromarray(img)) for img in list_image],dim=0).to(device) # omit the Image.fromarray if the images already in PIL format, change this line to images=list_image if using preprocess inside the dataset class
  texts = clip.tokenize(list_txt).to(device)

  logits_per_image, logits_per_text = model(images, texts)

  ground_truth = torch.arange(BATCH_SIZE,dtype=torch.long,device=device)

  total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2
  total_loss.backward()
  if device == "cpu":
     optimizer.step()
  else : 
    convert_models_to_fp32(model)
    optimizer.step()
    clip.model.convert_weights(model)

Knowing the following: np_list.shape --> (111, 512, 512, 3) len(txt_list) --> 111

BelhalK commented 2 years ago
clip.load("ViT-B/32",device=device,jit=False)

Good morning,

Is this chunk of code used for fine-tuning or training? If it's for fine-tuning, I guess you start from a pretrained ViT. Hence, where do you introduce the new parameters to be learnt on your private dataset? I don't understand what parameters are updated using your training script, could you help me figure it out?

Maybe clip.load("ViT-B/32",device=device,jit=False) loads pretrained model weights and add under the hood some new parameters to fine-tune?`

Thank you

vinson2233 commented 2 years ago

@BelhalK No, I don't introduce new parameters to make it learn from the new dataset. I use the original architecture, and I don't freeze the weight at all. So all the weight inside the clip (both the image and text ViT) will be adjusted to the image and text from my dataset.

BelhalK commented 2 years ago

@BelhalK No, I don't introduce new parameters to make it learn from the new dataset. I use the original architecture, and I don't freeze the weight at all. So all the weight inside the clip (both the image and text ViT) will be adjusted to the image and text from my dataset.

Thanks for the reply! So I guess both the text and image models have the same learnt weights since you seem to train a unique model, do you confirm? Thank you

ElifCerenGokYildirim commented 2 years ago

Hi, I am using exactly the same code to train CIFAR10 dataset. I have defined transform=preprocess while loading the dataset. I defined images=list_image and texts = clip.tokenize(list_txt).to(device). However, when I run the code I get this error: TypeError: len() of a 0-d tensor. My list_txt is equal to this = tensor([6, 9, 9, 4, 1, 1, 2, 7, 8, 3, 4, 7, 7, 2, 9, 9, 9, 3, 2, 6, 4, 3, 6, 6, 2, 6, 3, 5, 4, 0, 0, 9]). So I did not understand what is the problem. Thanks in advance :)

AhmedEwis commented 2 years ago

Hello all I have been using the above code and I have already got my weight and biases model.pt for my custom dataset. I am trying now to implement my custom model to the following repo: https://github.com/rmokady/CLIP_prefix_caption but I am getting the below error: RuntimeError: Error(s) in loading state_dict for CLIP: Missing key(s) in state_dict: "positional_embedding", "text_projection", "logit_scale", "visual.class_embedding", "visual.positional_embedding", "visual.proj", "visual.conv1.weight", "visual.ln_pre.weight", "visual.ln_pre.bias", "visual.transformer.resblocks.0.attn.in_proj_weight", "visual.transformer.resblocks.0.attn.in_proj_bias", "visual.transformer.resblocks.0.attn.out_proj.weight", "visual.transformer.resblocks.0.attn.out_proj.bias", "visual.transformer.resblocks.0.ln_1.weight", "visual.transformer.resblocks.0.ln_1.bias", "visual.transformer.resblocks.0.mlp.c_fc.weight", "visual.transformer.resblocks.0.mlp.c_fc.bias", "visual.transformer.resblocks.0.mlp.c_proj.weight", "visual.transformer.resblocks.0.mlp.c_proj.bias", "visual.transformer.resblocks.0.ln_2.weight", "visual.transformer.resblocks.0.ln_2.bias", "visual.transformer.resblocks.1.attn.in_proj_weight", "visual.transformer.resblocks.1.attn.in_proj_bias", "visual.transformer.resblocks.1.attn.out_proj.weight", "visual.transformer.resblocks.1.attn.out_proj.bias", "visual.transformer.resblocks.1.ln_1.weight", "visual.transformer.resblocks.1.ln_1.bias", "visual.transformer.resblocks.1.mlp.c_fc.weight", "visual.transformer.resblocks.1.mlp.c_fc.bias", "visual.transformer.resblocks.1.mlp.c_proj.weight", "visual.transformer.resblocks.1.mlp.c_proj.bias", "visual.transformer.resblocks.1.ln_2.weight", "visual.transformer.resblocks.1.ln_2.bias", "visual.transformer.resblocks.2.attn.in_proj_weight", "visual.transformer.resbl... Unexpected key(s) in state_dict: "epoch", "model_state_dict", "optimizer_state_dict", "loss".

AhmedEwis commented 2 years ago

Hello all I have been using the above code and I have already got my weight and biases model.pt for my custom dataset. I am trying now to implement my custom model to the following repo: https://github.com/rmokady/CLIP_prefix_caption but I am getting the below error: RuntimeError: Error(s) in loading state_dict for CLIP: Missing key(s) in state_dict: "positional_embedding", "text_projection", "logit_scale", "visual.class_embedding", "visual.positional_embedding", "visual.proj", "visual.conv1.weight", "visual.ln_pre.weight", "visual.ln_pre.bias", "visual.transformer.resblocks.0.attn.in_proj_weight", "visual.transformer.resblocks.0.attn.in_proj_bias", "visual.transformer.resblocks.0.attn.out_proj.weight", "visual.transformer.resblocks.0.attn.out_proj.bias", "visual.transformer.resblocks.0.ln_1.weight", "visual.transformer.resblocks.0.ln_1.bias", "visual.transformer.resblocks.0.mlp.c_fc.weight", "visual.transformer.resblocks.0.mlp.c_fc.bias", "visual.transformer.resblocks.0.mlp.c_proj.weight", "visual.transformer.resblocks.0.mlp.c_proj.bias", "visual.transformer.resblocks.0.ln_2.weight", "visual.transformer.resblocks.0.ln_2.bias", "visual.transformer.resblocks.1.attn.in_proj_weight", "visual.transformer.resblocks.1.attn.in_proj_bias", "visual.transformer.resblocks.1.attn.out_proj.weight", "visual.transformer.resblocks.1.attn.out_proj.bias", "visual.transformer.resblocks.1.ln_1.weight", "visual.transformer.resblocks.1.ln_1.bias", "visual.transformer.resblocks.1.mlp.c_fc.weight", "visual.transformer.resblocks.1.mlp.c_fc.bias", "visual.transformer.resblocks.1.mlp.c_proj.weight", "visual.transformer.resblocks.1.mlp.c_proj.bias", "visual.transformer.resblocks.1.ln_2.weight", "visual.transformer.resblocks.1.ln_2.bias", "visual.transformer.resblocks.2.attn.in_proj_weight", "visual.transformer.resbl... Unexpected key(s) in state_dict: "epoch", "model_state_dict", "optimizer_state_dict", "loss".

Issue fixed by loading only the model state dictionary checkpoints instead of the:

model.load_state_dict(checkpoint['model_state_dict'])

AhmedEwis commented 2 years ago

I am trying to implement a custom clip:

model = clip_model.eval() device = CUDA(0) if is_gpu else "cpu" model = clip_model.to(device)

use_beam_search = False

import pandas as pd prefix_length=10

df = pd.DataFrame([],columns=['image','desc']) for fname in [i for i in os.listdir('image') if i.endswith('.png')]: image = io.imread('image/'+fname) pil_image = PIL.Image.fromarray(image)
image = preprocess(pil_image).unsqueeze(0).to(device)

with torch.no_grad(): prefix = model.encode_image(image).to(device, dtype=torch.float32) prefix_embed = model.clip_project(prefix).reshape(1, prefix_length, -1)

if use_beam_search: generated_text_prefix = generate_beam(model, tokenizer, embed=prefix_embed)[0] else: generated_text_prefix = generate2(model, tokenizer, embed=prefix_embed)

df.loc[len(df)] = [fname,generated_text_prefix] print(generated_text_prefix)

Getting error in the below code: #prefix_embed = model.clip_project(prefix).reshape(1, prefix_length, -1)

ModuleAttributeError: 'CLIP' object has no attribute 'clip_project'

kxkaixin commented 2 years ago

Hello, thank you for your work. May I ask you some questions? My code has several lines like this: model, preprocess = clip.load("RN50", device=device) for img, text, lab, id in train_loader: With the torch. no_grad () : Image_features = model. Encode_image (img) Text_features = model. Encode_text (text)

  1. For both Image_features = model.encode_image(img) and text_features = model.encode_text(text) , the data dimension is [batch_size, emb_dim]. If I want to get more abundant data dimension, how can I get it?
  2. When I debug at image_features = model.encode_image(img), I want to know how the image is encoded, but when executing to def encode_image (self image) : return self.visual(image.type(self.dtype))); the self.visual () network processing the image does not enter the expected network. It jumps straight out. What causes it? I used PyCharm to debug.
  3. Due to limited equipment resources, if feature extraction is performed using the CLIP network without parameter update (with torch.no_grad()), but the downstream task effect is to be improved, how can we adjust and improve it? Is it possible to add a new function module? I look forward to your reply. I hope you will forgive me for not knowing a lot in my first contact.
vinson2233 commented 2 years ago

@kxkaixin

  1. For both Image_features = model.encode_image(img) and text_features = model.encode_text(text) , the data dimension is [batch_size, emb_dim]. If I want to get more abundant data dimension, how can I get it?

Do you mean you want the embedding result to become larger right? Since CLIP already trained and the transformer's last layer have dimension of 512, I think one of the simplest way to add the dimensionality is to add another Dense layer at the end. So instead of using forward from CLIP, just use encode_image, then pass to another dense with higher dimension. Do the same with encode_text also (Disclaimer : I never test this approach).

  1. not sure, i never use debugger

  2. I think you wanted to use the embedding from CLIP model for downstream tasks right? I believe the architecture will look like this : Data => CLIP => YourModel => Output. You can try to optimize yourModel without modifying the CLIP. But if you must adjust the CLIP as well, then I suggest you create a new class, where the class has both CLIP and the model used for the downstream task, and then optimize the new class to optimize both the downstream model as well the CLIP model. Would you use one embedding or both embedding? If you only use one (example image only), then training the visual part of CLIP only will make the embedding between text and images misaligned, and if you wanted to make sure that the CLIP can be fine-tuned to create good image-text representation as well for your downstream task, then I suggest you to use the MultiTaskLearning approach to optimize 2 tasks at the same time.

kxkaixin commented 2 years ago

https://github.com/openai/CLIP/issues/83#issuecomment-1134134811 Thank you very much for your reply. At the same time, thank you for your detailed explanation, which benefited me a lot.

In addition, I am very sorry to ask you a question.

For the first question, I don't mean that the value of [batch_size, embdim] obtained by model.encode image (img) changes from [100,512] to [100,1024], but whether more multidimensional information can be obtained. For example, can the CLIP model be used to obtain the type of data information such as [batch_size, C, H, W] for the image? (It doesn't have to be that way, I'm not sure about the form of data I can get, so I'm using this clunky example.)Because I couldn't jump to the expected location during debugging, I can only get data in the form of [batch_size, emb_dim] at present. That's why I asked you the second question.

sarahESL commented 2 years ago

Question about the CLIP itself really: does anyone know why they assign random labels in each iteration? I mean why random? Is it mentioned in the original paper?

image

vinson2233 commented 2 years ago

@kxkaixin do you mean that you want to know how the input changes trough the network? Like how the data changes from [BatchSize,R,G,B] => [BatchSize,XXX,YYY] => ... => [BatchSize,512]. I never try to look at them, but since this repo written in plain Pytorch, I think this stack overflow will be helpful https://stackoverflow.com/questions/42480111/model-summary-in-pytorch

vinson2233 commented 2 years ago

@sarahESL No, it's not a random number. The cross_entropy_loss is accept a label in an integer-based position(not binary one-hot format). For example, let's say I wanted to create a fruit classification. I can mark apple as 0, banana as 1, and melon as 2. If I have [apple,apple,melon,banana], then the labels will become [0,0,2,1]. Now with CLIP, we provide a pair list of images and text. For example, if I have 10 pairs, it means I will have 10 images and 10 texts. The first image should only be matched with the first text, and the second image only to the second text until the 10th image is corresponding to the 10th text. Putting those data will create a logits matrix with the dimension of 10x10. The way we look at it is, that for the first row, we have the cosine similarity to 10 other values/columns, but the correct value should be the first one (the 0th index). The second row should have the highest similarity with the second column (the label is 1 for the 2nd column), until the last row which should be matched with the last col (index number: 9).

kxkaixin commented 2 years ago

Ok, thank you for your reply. Much appreciated.

smith-co commented 2 years ago
# TO ADD :
# Gradient Checkpointing
# Filter out bias from weight decay
# Decaying learning rate with cosine schedule
# Half-precision Adam statistics
# Half-precision stochastically rounded text encoder weights were used

Do you plan to update the snippet to address the above todos?

Weifeng-Chen commented 2 years ago

【自动恢复】来信已收到,我将尽快回复你!

vinson2233 commented 2 years ago

@smith-co : Nope, I don't plan to at the moment.

sanjaygunda13 commented 2 years ago

I want to custome train clip model my data is having captions and images data in b64. i tried training my data using coco but not able to do as i am getting some cuda error can someone help me out please.

abdullah-jahangir commented 2 years ago

Not really an issue, I just want to share my training code since some people still have some difficulties to write the training code. Just modify the code to suit your usage. Feel free to ask or point out any mistakes in my code.

# Latest Update : 31 May 2022, 09:55 GMT+7

# TO ADD :
# Gradient Checkpointing
# Filter out bias from weight decay
# Decaying learning rate with cosine schedule
# Half-precision Adam statistics
# Half-precision stochastically rounded text encoder weights were used

#BATCH_SIZE must larger than 1

device = "cuda:0" if torch.cuda.is_available() else "cpu" # If using GPU then use mixed precision training.
model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training

class image_title_dataset(Dataset):
    def __init__(self, list_image_path,list_txt):

        self.image_path = list_image_path
        self.title  = clip.tokenize(list_txt) #you can tokenize everything at once in here(slow at the beginning), or tokenize it in the training loop.

    def __len__(self):
        return len(self.list_txt)

    def __getitem__(self, idx):
        image = preprocess(Image.open(self.image_path[idx])) # Image from PIL module
        title = self.title[idx]
        return image,title

# use your own data
list_image_path = ['folder/image1.jpg','folder2/image2.jpg'] 
list_txt = ['description for image1.jpg' , 'description for image2.jpg']
dataset = image_title_dataset(list_image_path,list_txt)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader

#https://github.com/openai/CLIP/issues/57
def convert_models_to_fp32(model): 
    for p in model.parameters(): 
        p.data = p.data.float() 
        p.grad.data = p.grad.data.float() 

if device == "cpu":
  model.float()
else :
  clip.model.convert_weights(model) # Actually this line is unnecessary since clip by default already on float16

loss_img = nn.CrossEntropyLoss()
loss_txt = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2) #Params used from paper, the lr is smaller, more safe for fine tuning to new dataset

# add your own code to track the training progress.
for epoch in range(EPOCH):
  for batch in train_dataloader :
      optimizer.zero_grad()

      images,texts = batch 

      images= images.to(device)
      texts = texts.to(device)

      logits_per_image, logits_per_text = model(images, texts)

      ground_truth = torch.arange(len(images),dtype=torch.long,device=device)

      total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2
      total_loss.backward()
      if device == "cpu":
         optimizer.step()
      else : 
        convert_models_to_fp32(model)
        optimizer.step()
        clip.model.convert_weights(model)

Code to save the model :

torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': total_loss,
        }, f"model_checkpoint/model_10.pt") #just change to your preferred folder/filename

Code to load the saved model :

model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training
checkpoint = torch.load("model_checkpoint/model_10.pt")

# Use these 3 lines if you use default model setting(not training setting) of the clip. For example, if you set context_length to 100 since your string is very long during training, then assign 100 to checkpoint['model_state_dict']["context_length"] 
checkpoint['model_state_dict']["input_resolution"] = model.input_resolution #default is 224
checkpoint['model_state_dict']["context_length"] = model.context_length # default is 77
checkpoint['model_state_dict']["vocab_size"] = model.vocab_size 

model.load_state_dict(checkpoint['model_state_dict'])

Alternative training code :

I am getting the following error when I run the code: AttributeError: 'image_title_dataset' object has no attribute 'list_txt', can you please help with this?

vinson2233 commented 2 years ago

@abdullah-jahangir slight typo in my code, i fixed it. just change the len definition inside the class

vinson2233 commented 2 years ago

@sanjaygunda13 I never tried the processing in Base64, maybe you need to try to decode the Byte64 into PIL format first.

abdullah-jahangir commented 2 years ago

@abdullah-jahangir slight typo in my code, i fixed it. just change the len definition inside the class

Thank you, however I am now getting a new error: File "train.py", line 32, ingetitem image = preprocess(Image.open(self. image_path[idx])) # Image from PIL module. I do not understand this as the number of images and texts are both equal.

gongshaojie12 commented 2 years ago

Hi, thank you for your work. I have a question. How to fine-tune with clip in my own chinese dataset? Thanks!

antu-saha commented 2 years ago

Hi, thank you very much for the great work. I am trying to use your code for my data. Could you please give me the Dataset and DataLoader class? or Can you tell how they should look like or what will they do?

sanjaygunda13 commented 2 years ago

Hello @vinson2233 can you help me out how to fine tune clip vitb32 model. I am struggling from long time to understand this. thanks

vgthengane commented 2 years ago

How can I freeze the clip model weight? So that when I am using CLIP as a teacher model in knowledge distillation, CLIP model weights should not change.

Asthestarsfalll commented 2 years ago

@vgthengane Maybe you can use eval method like this:

model = clip.load('RN50')
model.eval()