Open AHEADer opened 2 years ago
Hi, please refer to this issue here for solution: https://github.com/salesforce/BLIP/issues/76
Also, it is recommended to try out our new library: https://github.com/salesforce/LAVIS for training.
Thanks for your answer! I'll have a try
Hi, I tried what you suggested in this way:
Old code:
for b in range(bs):
neg_idx = torch.multinomial(weights_t2i[b], 1).item()
image_embeds_neg.append(image_embeds[neg_idx])
Current code:
for b in range(bs):
nan_idx = weights_t2i[b].isnan()
weights_t2i[b][nan_idx] = 0.0001
neg_idx = torch.multinomial(weights_t2i[b], 1).item()
image_embeds_neg.append(image_embeds[neg_idx])
now all losses become nan now:
I may need your help with this. Much thanks!
It is hard for me to diagnose what is happening. Could you provide more information on the batch that leads to NaN loss?
Sorry for the late reply, it's little sensitive in our data... I'll try to squeeze time to experiment in some public datasets and let you know...
When I try to use models/model_pretrain.py to train my datasets, nan error raises like this:
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)
I filtered all nan value in torch.multinomial input but similar error still occur, which means all values in input are nan. I think overflow is happened here.
My pretrained weights are imagenet ViT + bert-based-chinese(random initial other weights). I use amp for training. I set the temp range in [0.1, 1]. I also set the gradient clip in AdamW optimizer with max_norm=5.0 and norm_type=2.0. I use warmup for training and have a start learning rate at 1e-9, when the learning rate comes around 1e-5 this nan error occurs.
Can you give me some suggestions on this? Much appreciated!