CLIP Training Code - Githubissues

vinson2233 commented 3 years ago

Not really an issue, I just want to share my training code since some people still have some difficulties to write the training code. Just modify the code to suit your usage. Feel free to ask or point out any mistakes in my code.

# Latest Update : 18 July 2022, 09:55 GMT+7

# TO ADD :
# Gradient Checkpointing
# Filter out bias from weight decay
# Decaying learning rate with cosine schedule
# Half-precision Adam statistics
# Half-precision stochastically rounded text encoder weights were used

#BATCH_SIZE must larger than 1

device = "cuda:0" if torch.cuda.is_available() else "cpu" # If using GPU then use mixed precision training.
model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training

class image_title_dataset(Dataset):
    def __init__(self, list_image_path,list_txt):

        self.image_path = list_image_path
        self.title  = clip.tokenize(list_txt) #you can tokenize everything at once in here(slow at the beginning), or tokenize it in the training loop.

    def __len__(self):
        return len(self.title)

    def __getitem__(self, idx):
        image = preprocess(Image.open(self.image_path[idx])) # Image from PIL module
        title = self.title[idx]
        return image,title

# use your own data
list_image_path = ['folder/image1.jpg','folder2/image2.jpg'] 
list_txt = ['description for image1.jpg' , 'description for image2.jpg']
dataset = image_title_dataset(list_image_path,list_txt)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader

#https://github.com/openai/CLIP/issues/57
def convert_models_to_fp32(model): 
    for p in model.parameters(): 
        p.data = p.data.float() 
        p.grad.data = p.grad.data.float() 

if device == "cpu":
  model.float()
else :
  clip.model.convert_weights(model) # Actually this line is unnecessary since clip by default already on float16

loss_img = nn.CrossEntropyLoss()
loss_txt = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2) #Params used from paper, the lr is smaller, more safe for fine tuning to new dataset

# add your own code to track the training progress.
for epoch in range(EPOCH):
  for batch in train_dataloader :
      optimizer.zero_grad()

      images,texts = batch 

      images= images.to(device)
      texts = texts.to(device)

      logits_per_image, logits_per_text = model(images, texts)

      ground_truth = torch.arange(len(images),dtype=torch.long,device=device)

      total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2
      total_loss.backward()
      if device == "cpu":
         optimizer.step()
      else : 
        convert_models_to_fp32(model)
        optimizer.step()
        clip.model.convert_weights(model)

NOTE :
that for inference purpose, the conversion step from fp16 to fp32 is not needed, just use the model in full fp16
For multi-GPU training, see my comment on https://github.com/openai/CLIP/issues/111#issuecomment-854320770
I'm not the author of this model nor having any relationship with the author. I'm just a random guy who interested in CLIP.
For training image-image or text-text, please refer to this principle : https://github.com/openai/CLIP/issues/83#issuecomment-1487820198
What is the difference between image loss and text loss? isn't one just a transposed version of the other one? read this then https://github.com/openai/CLIP/issues/83#issuecomment-1489603702
Why the ground truth is torch.arange? https://github.com/openai/CLIP/issues/83#issuecomment-1141139017

Code to save the model :

torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': total_loss,
        }, f"model_checkpoint/model_10.pt") #just change to your preferred folder/filename

Code to load the saved model :

model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training
checkpoint = torch.load("model_checkpoint/model_10.pt")

# Use these 3 lines if you use default model setting(not training setting) of the clip. For example, if you set context_length to 100 since your string is very long during training, then assign 100 to checkpoint['model_state_dict']["context_length"] 
checkpoint['model_state_dict']["input_resolution"] = model.input_resolution #default is 224
checkpoint['model_state_dict']["context_length"] = model.context_length # default is 77
checkpoint['model_state_dict']["vocab_size"] = model.vocab_size 

model.load_state_dict(checkpoint['model_state_dict'])

Alternative training code :

@zasder3 have created a PyTorch lighting version to train the CLIP https://github.com/Zasder3/train-CLIP
@mitchellnw researchers at UW, Google, Stanford, Amazon, Columbia, and Berkeley also create their training code https://github.com/mlfoundations/open_clip

nikky4D commented 3 years ago

Very helpful. Thank you

vkmavani commented 3 years ago

Not really an issue, I just want to share my training code since some people still have some difficulties to write the training code Feel free to ask or point out any mistakes in my code.

train_dataloader = DataLoader(...,batch_size = BATCH_SIZE) #Define your own dataloader

#https://github.com/openai/CLIP/issues/57
def convert_models_to_fp32(model): 
    for p in model.parameters(): 
        p.data = p.data.float() 
        p.grad.data = p.grad.data.float() 

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32",device=device,jit=False) #Must set jit=False for training
clip.model.convert_weights(model)

loss_img = nn.CrossEntropyLoss()
loss_txt = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=5e-5,betas=(0.9,0.98),eps=1e-6,weight_decay=0.2) #Params from paper

for batch in train_dataloader :
    optimizer.zero_grad()

    list_image,list_txt = batch #list_images is list of image in numpy array(np.uint8)

    images= torch.stack([preprocess(Image.fromarray(img)) for img in list_image],dim=0)
    texts = clip.tokenize(list_txt)

    logits_per_image, logits_per_text = model(images, texts)

    ground_truth = torch.arange(BATCH_SIZE).to(device)
    total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2
    total_loss.backward()

    convert_models_to_fp32(model)
    optimizer.step()
    clip.model.convert_weights(model)

Hi, Thank you for this training code. I have a dataset, where I want to check the image similarity, and I want to use the CLIP. But I don't know how to prepare(image_size, embedding_size, transforms, etc) a dataset to feed this training code. Can you please provide me the dataset class if possible?

vinson2233 commented 3 years ago

@vkmavani sure. The preprocess object from CLIP takes care of all of the preprocessing steps for the image part, so you don't need to worry about image_size or transform(see https://github.com/openai/CLIP/blob/main/clip/clip.py line 58). For example, maybe your data look like this :

| image  | caption  |
---------------------
| url1   | caption1 |
| url2   | caption2 |

where the URL is the path to the image and the caption is the string of the caption.

Here's the dataset class definition for image-text similarity :


from PIL import Image

class image_caption_dataset(Dataset):
    def __init__(self, df):

        self.images = df["image"].tolist()
        self.caption = df["caption"].tolist()

    def __len__(self):
        return len(self.caption)

    def __getitem__(self, idx):

        images = preprocess(Image.open(self.images[idx])) #preprocess from clip.load
        caption = self.caption[idx]
        return images,caption

dataset = image_caption_dataset(df)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader

With this dataset definition, you can omit the Image.fromarray() and the preprocess step after loading the batch since the actual data already in tensor format

If you are interested in doing image-image similarity, just modify the dataset to return pair of images and for the training code, adjust the code accordingly, a big change will happen in the creating the logits part. Change the forward method logits_per_image, logits_per_text = model(images, texts) according to https://github.com/openai/CLIP/blob/main/clip/model.py, line 354.

lonngxiang commented 3 years ago

what is the clip.model.convert_weights meaning? and can you Provide a complete training code if possible

vinson2233 commented 3 years ago

@lonngxiang For more information, read https://github.com/openai/CLIP/issues/57, clip.model.convert_weights basically convert the CLIP model weight into float16. This will help accelerate and reduce memory usage during training. The definition of clip.model.convert_weight can be found at https://github.com/openai/CLIP/blob/main/clip/model.py line 371

I can't give a fully working example code since I'm using a private dataset, but I believe the training code and dataset code that I provided is sufficient.

lonngxiang commented 3 years ago

@lonngxiang For more information, read #57, clip.model.convert_weights basically convert the CLIP model weight into float16. This will help accelerate and reduce memory usage during training. The definition of clip.model.convert_weight can be found at https://github.com/openai/CLIP/blob/main/clip/model.py line 371

I can't give a fully working example code since I'm using a private dataset, but I believe the training code and dataset code that I provided is sufficient.

Thank you for your kind reply

lonngxiang commented 3 years ago

there is a error when run this train code： TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'PIL.JpegImagePlugin.JpegImageFile'>

vkmavani commented 3 years ago

@vkmavani sure. The preprocess object from CLIP takes care of all of the preprocessing steps for the image part, so you don't need to worry about image_size or transform(see https://github.com/openai/CLIP/blob/main/clip/clip.py line 58). For example, maybe your data look like this :
| image  | caption  |
---------------------
| url1   | caption1 |
| url2   | caption2 |
where the URL is the path to the image and the caption is the string of the caption.

Here's the dataset class definition for image-text similarity :
from PIL import Image

class image_caption_dataset(Dataset):
    def __init__(self, df):

        self.images = df["image"].tolist()
        self.caption = df["caption"].tolist()

    def __len__(self):
        return len(self.caption)

    def __getitem__(self, idx):

        images = Image.open(self.images[idx])
        caption = self.caption[idx]
        return images,caption

dataset = image_caption_dataset(df)
train_dataloader = DataLoader(dataset,batch_size = BATCH_SIZE) #Define your own dataloader
With this dataset definition, you can omit the Image.fromarray() since the actual data already in PIL format.

If you are interested in doing image-image similarity, just modify the dataset to return pair of images and for the training code, adjust the code accordingly, a big change will happen in the creating the logits part. Change the forward method logits_per_image, logits_per_text = model(images, texts) according to https://github.com/openai/CLIP/blob/main/clip/model.py, line 354.

Thank you very much. It really helps a lot.

vinson2233 commented 3 years ago

@lonngxiang oh you are correct. pardon me, I have edited my code above. The dataset should return something that can be put on PyTorch tensor.

lonngxiang commented 3 years ago

@lonngxiang oh you are correct. pardon me, I have edited my code above. The dataset should return something that can be put on PyTorch tensor.

one more thing，when you use preprocess in class image_caption_dataset, the torch.stack's preprocess is it still useful?

lonngxiang commented 3 years ago

@lonngxiang oh you are correct. pardon me, I have edited my code above. The dataset should return something that can be put on PyTorch tensor.

still have a error in images= torch.stack([preprocess(Image.fromarray(img)) for img in list_image],dim=0):

AttributeError: 'Tensor' object has no attribute '__array_interface__'

vinson2233 commented 3 years ago

Yeah, if already using preprocess inside the class. The result from the batch can be used directly to the CLIP. So that line can be change into this : images = list_image

lonngxiang commented 3 years ago

Yeah, if already using preprocess inside the class. The result from the batch can be used directly to the CLIP. So that line can be change into this : images = list_image

then have anthor error: RuntimeError: "unfolded2d_copy" not implemented for 'Half'

vinson2233 commented 3 years ago

Hmmmm, that error is new for me. Is the error occurred when calculating the loss?

lonngxiang commented 3 years ago

Hmmmm, that error is new for me. Is the error occurred when calculating the loss?

yes,the error occurred in this line: logits_per_image, logits_per_text = model(images, texts)

add model(images.float(), texts.float()) still error: RuntimeError: "unfolded2d_copy" not implemented for 'Half'

vinson2233 commented 3 years ago

Are you using CPU by any chance? The mixed precision training usually don't work on CPU

lonngxiang commented 3 years ago

Are you using CPU by any chance? The mixed precision training usually don't work on CPU

yes, i use it on cpu

vinson2233 commented 3 years ago

@lonngxiang I have updated the code again. Basically, remove all code related to mixed-precision training when using CPU instead of GPU

lonngxiang commented 3 years ago

@lonngxiang I have updated the code again. Basically, remove all code related to mixed-precision training when using CPU instead of GPU

ok. so kind of you; Thank you for your patience

lonngxiang commented 3 years ago

@lonngxiang I have updated the code again. Basically, remove all code related to mixed-precision training when using CPU instead of GPU run it on cpu；There's still a problem. the total_loss is always 0

lonngxiang commented 3 years ago

@lonngxiang I have updated the code again. Basically, remove all code related to mixed-precision training when using CPU instead of GPU

how to set BATCH_SIZE to get ground_truth's label?

vinson2233 commented 3 years ago

@lonngxiang Hmmmm, I don't have the faintest idea why the loss is = 0.

BATCH_SIZE is just an integer that you set. Since the image-text are in pairs, the first image will correspond to the first text. So the ground truth for the first image is 0, the second image will correspond to the second image, so the ground truth is 1. This pattern keeps repeating until the last image-text pair. So the ground truth is a torch tensor like this : torch.tensor([0,1,2,3,...,BATCH_SIZE-1]). Since the pre-trained CLIP use a massive batch size, just try to use the largest BATCH_SIZE as your system can take.

You can read more info about cross-entropy loss https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html, especially about the target. Also the CLIP paper, page 5, the upper left part.

lonngxiang commented 3 years ago

@lonngxiang Hmmmm, I don't have the faintest idea why the loss is = 0.

BATCH_SIZE is just an integer that you set. Since the image-text are in pairs, the first image will correspond to the first text. So the ground truth for the first image is 0, the second image will correspond to the second image, so the ground truth is 1. This pattern keeps repeating until the last image-text pair. So the ground truth is a torch tensor like this : torch.tensor([0,1,2,3,...,BATCH_SIZE-1]). Since the pre-trained CLIP use a massive batch size, just try to use the largest BATCH_SIZE as your system can take.

You can read more info about cross-entropy loss https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html, especially about the target. Also the CLIP paper, page 5, the upper left part.

tks for your reply；so If you have five pairs, so your BATCH_SIZE is five，is right？

vinson2233 commented 3 years ago

Your BATCH_SIZE will determince the number of pairs for each batch

For example, If you have 1000 pairs, and set BATCH_SIZE = 20. Then for each loop of for batch in train_dataloader, the variable batch will give you 20 pairs. The loop will be repeated 50 times to cover all the data for 1 epoch.

lonngxiang commented 3 years ago

Your BATCH_SIZE will determince the number of pairs for each batch

For example, If you have 1000 pairs, and set BATCH_SIZE = 20. Then for each loop of for batch in train_dataloader, the variable batch will give you 20 pairs. The loop will be repeated 50 times to cover all the data for 1 epoch.

yes，but when I set BATCH_SIZE = 1，the total_loss is always 0，is this right？What's wrong with it

vinson2233 commented 3 years ago

Yes, that's the problem. BATCH_SIZE must be greater than 1. The reason is your prediction will return cosine similarity for that image and that text. CrossEntropyLoss is combination of softmax with logloss. Since one row only has 1 prediction(because BATCH_SIZE=1), the softmax will return probability=1 for that entry(It doesn't matter whether the logits is high or low), where it automatically correspond to the correct ground truth.

lonngxiang commented 3 years ago

Yes, that's the problem. BATCH_SIZE must be greater than 1. The reason is your prediction will return cosine similarity for that image and that text. CrossEntropyLoss is combination of softmax with logloss. Since one row only has 1 prediction(because BATCH_SIZE=1), the softmax will return probability=1 for that entry(It doesn't matter whether the logits is high or low), where it automatically correspond to the correct ground truth.

Thank you for helping me a lot and learning a lot

dmoham1476 commented 3 years ago

Don't we need to do clip.load_state_dict after clip.load?
Are we not doing model.encode_image and model.encode_text and then doing norm before training?
Can you please add demo code for early stopping, saving the model (.pt) and metrics as well
Are we fine-tuning only ViT and not the text part? How did this impact performance on custom dataset?

vinson2233 commented 3 years ago

@dmoham1476

No. See this code https://github.com/openai/CLIP/blob/main/clip/clip.py line 114. They already load the model when we calling CLIP. Only use torch load_state_dict to continue training.
Yes, that all happen inside forward function. See this code https://github.com/openai/CLIP/blob/main/clip/model.py line 354. If you want to train text and train similarity with one to one pair, the forward already take care off the encode_image, encode_text and normalizing.


EARLYSTOP_PATIENCE = 10 # Define your own number
best_loss = np.Inf
best_iter = 0
for epoch in range(EPOCH):
for batch in train_dataloader :
  <do training>
  if device == "cpu":
     optimizer.step()
  else : 
    convert_models_to_fp32(model)
    optimizer.step()
    clip.model.convert_weights(model)

# EVALUATION ON VALIDATION DATASET
for batch in validation_dataloader :
<do forward prop on batch validation data>
val_loss = <calculate loss>

if val_loss < best_loss :
best_iter = epoch+1
best_loss = val_loss

torch.save({
    'epoch': k,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
    }, f"save_dir")

if ((epoch+1)-best_iter)>EARLYSTOP_PATIENCE:
 print("Early stop achieved at", epoch+1)
 break

4. After loading the CLIP. Try to print the CLIP. It will show a long list of layers. You can call the component like this : `model.transformer`, `model.visual.transformer`. The text part only using transformers. While the visual part, also using transformers(it's the model.visual.transformer). Loading CLIP will allow you to train all the parts by default. You can freeze some components for example like this :

for k in model.visual.transformer.parameters():
k.requires_grad=False


This code will freeze all the visual parts.
I encourage you to see the components of CLIP

uplusv commented 3 years ago

Hi, Vinson! Thank you for your code, it helps me a lot! but I met a problem when I fine-tune CLIP on my own data with your code. The task is to classify a 6-class problem so I set batch_size=6. After fine-tuning, the model outputs sample feature for every image, is it the problem of small batch size or fixed order of 6 classes or perhaps something else?

vinson2233 commented 3 years ago

@uplusv If you want to modify CLIP as a classifier(Single label, multi class), here's some modification you can do :

Change the ground_truth = torch.arange(BATCH_SIZE).to(device) to integer vector that specify which class your image are on (for example torch.tensor([0,1,2,1,2,3,4,5])). With this now you can set your batch size in arbitrary size.
One image should match 1 label, but 1 label can match will multiple images. You can omit the loss_txt in the total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2 to total_loss = loss_img(logits_per_image,ground_truth)

I'm not sure what you meant by "After fine-tuning, the model outputs sample feature for every image"

uplusv commented 3 years ago

@uplusv If you want to modify CLIP as a classifier(Single label, multi class), here's some modification you can do :

Change the ground_truth = torch.arange(BATCH_SIZE).to(device) to integer vector that specify which class your image are on (for example torch.tensor([0,1,2,1,2,3,4,5])). With this now you can set your batch size in arbitrary size.

One image should match 1 label, but 1 label can match will multiple images. You can omit the loss_txt in the total_loss = (loss_img(logits_per_image,ground_truth) + loss_txt(logits_per_text,ground_truth))/2 to total_loss = loss_img(logits_per_image,ground_truth)

I'm not sure what you meant by "After fine-tuning, the model outputs sample feature for every image"

Thank you for your reply and advice, I will try it soon! By "After fine-tuning, the model outputs sample feature for every image", I mean that, with "image_features = model.encode_image(image_input)" I print this "image_features" and get image_features: tensor([ [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], ..., [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039], [ 0.0098, 0.0047, 0.0057, ..., 0.0018, 0.0056, -0.0039]]) while the original model outputs: image_features: tensor([ [ 0.0304, -0.0169, -0.0383, ..., 0.0927, 0.0261, 0.0203], [ 0.0013, -0.0067, -0.0524, ..., 0.1029, 0.0028, 0.0169], [ 0.0115, -0.0006, -0.0392, ..., 0.0616, 0.0317, 0.0171], ..., [ 0.0173, -0.0152, -0.0431, ..., 0.0836, 0.0405, 0.0268], [ 0.0287, -0.0236, -0.0401, ..., 0.0856, 0.0119, 0.0287], [ 0.0150, 0.0013, -0.0537, ..., 0.0792, 0.0104, 0.0062]]) After fine-tuning, the features become same and smaller so I get identical and large logits(like 99.8856) for every image😢.

vinson2233 commented 3 years ago

Hmmm, I don't know what caused the model to produce the same value. Maybe something broke inside your data loader. Whatever the cause is, I hope you can find your solution.

Zasder3 commented 3 years ago

For those looking here in the future, I've made use of @vinson2233's code to create an easy-to-use PyTorch Lightning repo for training your own CLIP model from scratch.

ChawDoe commented 3 years ago

Hey, I try to train it from scratch. But I found that the model is hard to train. The loss remains stable after some iterations. Do you meet the same problem?

vinson2233 commented 3 years ago

@Zasder3 awesome, thanks for the effort 👍. It would be a blast if we can recreate every configuration from the paper since my code still lacks several features @ChawDoe For me, the training went quite smoothly. I use batch size 512(with 4 GPU), 1million pairs data, and gradient accumulation for 8 steps. First several step give me loss around 2, at 20 epoch my average raining loss is 0.14

ChawDoe commented 3 years ago

@vinson2233 Do you use fp16 training here? I think the problem may be due to my fp16 training and I set lr to 5e-5, which may lead to invalid gradients.

vinson2233 commented 3 years ago

@ChawDoe I'm using fp16 for forward pass and gradient calculation(backward), using fp32 for parameter update(step), just like the code I'm posted. Using full fp32 give slow training time and lower batch size, while using full fp16 give NaN gradient because of underflow. The gradient might differ slightly between fp16 and fp32 but it shouldn't affect the training to the point of wrong step direction.

lonngxiang commented 3 years ago

I use this code training to save PT or PKL files, but how do I load and reuse them later

vinson2233 commented 3 years ago

@lonngxiang i have update the code for save and load, basically to load the model use this code :

model, preprocess = clip.load("ViT-B/32",device=device,jit=False) 
checkpoint = torch.load("model_checkpoint/model_10.pt")

# Use these 3 lines if you use default model setting(not training setting) of the clip. For example, if you set context_length to 100 since your string is very long during training, then assign 100 to checkpoint['model_state_dict']["context_length"] 
checkpoint['model_state_dict']["input_resolution"] = model.input_resolution 
checkpoint['model_state_dict']["context_length"] = model.context_length
checkpoint['model_state_dict']["vocab_size"] = model.vocab_size 

model.load_state_dict(checkpoint['model_state_dict'])

Just modify the dict key to match your dict key when saving to .pt file

lonngxiang commented 3 years ago

@lonngxiang i have update the code for save and load, basically to load the model use this code :

model, preprocess = clip.load("ViT-B/32",device=device,jit=False) 
checkpoint = torch.load("model_checkpoint/model_10.pt")

# Use these 3 lines if you use default model setting(not training setting) of the clip. For example, if you set context_length to 100 since your string is very long during training, then assign 100 to checkpoint['model_state_dict']["context_length"] 
checkpoint['model_state_dict']["input_resolution"] = model.input_resolution 
checkpoint['model_state_dict']["context_length"] = model.context_length
checkpoint['model_state_dict']["vocab_size"] = model.vocab_size 

model.load_state_dict(checkpoint['model_state_dict'])

Just modify the dict key to match your dict key when saving to .pt i see，Let me try；tks

vinson2233 commented 3 years ago

@lonngxiang actually, you don't need to copy the entire message to reply to a specific message, especially for long message. You can use the copy link function on top right of the message to produce a URL like this that directed to the specific message: this https://github.com/openai/CLIP/issues/83#issue-853114174. Just to make things shorter

For your question, yes k is epoch and loss is total_loss, I just copy-paste from my actual code and forgot to change the variable, will fix that right away.

lonngxiang commented 3 years ago

@vinson2233 tks，https://github.com/openai/CLIP/issues/83#issue-853114174 ，is k equal EPOCH or epoch？i saw you write epoch now

vinson2233 commented 3 years ago

@lonngxiang I save my model in every epoch, so I use the epoch variable. Also, if the model training completed, then the epoch will equal to EPOCH-1. You also can change it torch.save({'epoch': epoch+1,...} so the savings start from 1 and the final save will have epoch key equal to EPOCH. Note that this epoch key will not affect the model behavior after loading, it only stores meta-data for the model. The same also goes for total_loss.

lonngxiang commented 3 years ago

How about the effect of fine-tuning? It seems to affect the previous normal results, and the effect is not good

vinson2233 commented 3 years ago

Some scenario for saving model :

Train and inference : The important thing to save only the model state
Train, pause and resume : Save the model, epoch and optimizer state. Your epoch counter in the loop will be the continuation of the last epoch. You also need to load the optimizer state from previous training (Adam need info about the running gradient), you can use the load_state_dict method on the optimizer to load checkpoint['optimizer_state_dict']

lonngxiang commented 3 years ago

Some scenario for saving model :

Train and inference : The important thing to save only the model state

Train, pause and resume : Save the model, epoch and optimizer state. Your epoch counter in the loop will be the continuation of the last epoch. You also need to load the optimizer state from previous training (Adam need info about the running gradient), you can use the load_state_dict method on the optimizer to load checkpoint['optimizer_state_dict']

how to load？like this？ model.load_state_dict(checkpoint['model_state_dict']) optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

vinson2233 commented 3 years ago

@lonngxiang That should be correct. Note that I never try to load the optimizer since I never pause the training.

lonngxiang commented 3 years ago

@lonngxiang That should be correct. Note that I never try to load the optimizer since I never pause the training. tks，Let me try it

vinson2233 commented 3 years ago

what is it that doesn't work? does it raise any error? I set jit=False when loading the model for the clip.load

openai / CLIP

CLIP Training Code #83