encounter nan error in finding neg idx by torch.multinomial

salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method

BSD 3-Clause "New" or "Revised" License

1.57k stars 199 forks source link

encounter nan error in finding neg idx by torch.multinomial #107

Open AHEADer opened 2 years ago

AHEADer commented 2 years ago

When I try to use models/model_pretrain.py to train my datasets, nan error raises like this: RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)

I filtered all nan value in torch.multinomial input but similar error still occur, which means all values in input are nan. I think overflow is happened here.

My pretrained weights are imagenet ViT + bert-based-chinese(random initial other weights). I use amp for training. I set the temp range in [0.1, 1]. I also set the gradient clip in AdamW optimizer with max_norm=5.0 and norm_type=2.0. I use warmup for training and have a start learning rate at 1e-9, when the learning rate comes around 1e-5 this nan error occurs. 1d0f6dd4ccefa6f15ab42a47a1f9f4a

Can you give me some suggestions on this? Much appreciated!

LiJunnan1992 commented 2 years ago

Hi, please refer to this issue here for solution: https://github.com/salesforce/BLIP/issues/76

Also, it is recommended to try out our new library: https://github.com/salesforce/LAVIS for training.

AHEADer commented 2 years ago

Thanks for your answer! I'll have a try

AHEADer commented 2 years ago

Hi, I tried what you suggested in this way:

Old code:

for b in range(bs):
    neg_idx = torch.multinomial(weights_t2i[b], 1).item()
    image_embeds_neg.append(image_embeds[neg_idx])

Current code:

for b in range(bs):
    nan_idx = weights_t2i[b].isnan()
    weights_t2i[b][nan_idx] = 0.0001
    neg_idx = torch.multinomial(weights_t2i[b], 1).item()
    image_embeds_neg.append(image_embeds[neg_idx])

now all losses become nan now:

I may need your help with this. Much thanks!

LiJunnan1992 commented 2 years ago

It is hard for me to diagnose what is happening. Could you provide more information on the batch that leads to NaN loss?

AHEADer commented 2 years ago

Sorry for the late reply, it's little sensitive in our data... I'll try to squeeze time to experiment in some public datasets and let you know...