lucidrains / DALLE2-pytorch

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch
MIT License
11.17k stars 1.09k forks source link

Loss NaN when training CLIP #310

Closed huutuongtu closed 10 months ago

huutuongtu commented 1 year ago

Hello, I am trying the sample code you provided for training CLIP, and the loss quickly decreased and jump to NaN. Secondly, as you can see, the loss of CLIP is negative. Is this normal?

import torch
from dalle2_pytorch import CLIP

clip = CLIP(
    dim_text = 512,
    dim_image = 512,
    dim_latent = 512,
    num_text_tokens = 49408,
    text_enc_depth = 1,
    text_seq_len = 256,
    text_heads = 8,
    visual_enc_depth = 1,
    visual_image_size = 256,
    visual_patch_size = 32,
    visual_heads = 8,
    use_all_token_embeds = True,            # whether to use fine-grained contrastive learning (FILIP)
    decoupled_contrastive_learning = True,  # use decoupled contrastive learning (DCL) objective function, removing positive pairs from the denominator of the InfoNCE loss (CLOOB + DCL)
    extra_latent_projection = True,         # whether to use separate projections for text-to-image vs image-to-text comparisons (CLOOB)
    use_visual_ssl = True,                  # whether to do self supervised learning on images
    visual_ssl_type = 'simclr',             # can be either 'simclr' or 'simsiam', depending on using DeCLIP or SLIP
    use_mlm = False,                        # use masked language learning (MLM) on text (DeCLIP)
    text_ssl_loss_weight = 0.05,            # weight for text MLM loss
    image_ssl_loss_weight = 0.05            # weight for image self-supervised learning loss
).cuda()

# mock data

text = torch.randint(0, 49408, (4, 256)).cuda()
images = torch.randn(4, 3, 256, 256).cuda()

import torch.optim as optim
optimizer = optim.AdamW(clip.parameters(), lr = 3e-3)
cnt = 1000
while cnt>0:
  loss = clip(text, images, return_loss=True)
  loss.backward()
  print(loss)
  cnt = cnt - 1
  optimizer.step()
  optimizer.zero_grad()

Loss:

tensor(23.7823, device='cuda:0', grad_fn=<AddBackward0>)
tensor(31.7683, device='cuda:0', grad_fn=<AddBackward0>)
tensor(17.6100, device='cuda:0', grad_fn=<AddBackward0>)
tensor(1.6753, device='cuda:0', grad_fn=<AddBackward0>)
tensor(0.3173, device='cuda:0', grad_fn=<AddBackward0>)
tensor(0.0118, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-0.2509, device='cuda:0', grad_fn=<AddBackward0>)
tensor(0.0360, device='cuda:0', grad_fn=<AddBackward0>)
tensor(0.1517, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-0.0780, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-0.4277, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-0.4603, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-0.1004, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-0.4624, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-0.7675, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-0.8480, device='cuda:0', grad_fn=<AddBackward0>)
...
...
tensor(-96.9268, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-96.9080, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-98.5298, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-98.6230, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-99.4977, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-100.2258, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-101.0956, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-102.0886, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-102.0255, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-104.2353, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-103.7044, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-105.2783, device='cuda:0', grad_fn=<AddBackward0>)
tensor(-inf, device='cuda:0', grad_fn=<AddBackward0>)
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
pcehnago commented 10 months ago

Did you solve it?

huutuongtu commented 10 months ago

Did you solve it?

No, I gave up on training it, but you should check this: https://github.com/lucidrains/x-clip/issues/9#issuecomment-1137505140

pcehnago commented 10 months ago

Did you solve it?

No, I gave up on training it, but you should check this: lucidrains/x-clip#9 (comment)

Thank you ! !

hariouat commented 6 months ago

Did you solve it?

No, I gave up on training it, but you should check this: lucidrains/x-clip#9 (comment)

Thank you ! !

Hello, did you solve the problem please? i have the same issue

pcehnago commented 6 months ago

Did you solve it?

No, I gave up on training it, but you should check this: lucidrains/x-clip#9 (comment)

Thank you ! !

Hello, did you solve the problem please? i have the same issue Yess, i changed the input data, it works. But i still dont know the specific result of nan loss