RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)

salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method

BSD 3-Clause "New" or "Revised" License

1.45k stars 193 forks source link

RuntimeError: invalid multinomial distribution (sum of probabilities <= 0) #93

Open yirutsai opened 1 year ago

yirutsai commented 1 year ago

Still facing the same issue when setting batch_size to 1 or 2. However, batch_size=4 is too big for my gpu memory. How could I fix this issue? Thanks.

SilentMoebuta commented 1 year ago

Have the same issue when I run 'Pretrain.py'.

LiJunnan1992 commented 1 year ago

Hi, you can try to add a small positive number to the weights as done here: https://github.com/salesforce/ALBEF/blob/fb384204472feab2a85bd4f5790d7889c31672c9/models/model_retrieval.py#L120

Batch_size=1 will not work because there needs to be at least 1 negative sample.

yirutsai commented 1 year ago

Hi, you can try to add a small positive number to the weights as done here:

https://github.com/salesforce/ALBEF/blob/fb384204472feab2a85bd4f5790d7889c31672c9/models/model_retrieval.py#L120

Batch_size=1 will not work because there needs to be at least 1 negative sample.

Hi LiJunnan1922, Thanks for answering. I have tried that method, however it is not worked for me. I have tried adding 1e-4 and 1e-8 but still getting same error. I reduce the image size to escape from OOM.

SilentMoebuta commented 1 year ago

Hi, you can try to add a small positive number to the weights as done here:

https://github.com/salesforce/ALBEF/blob/fb384204472feab2a85bd4f5790d7889c31672c9/models/model_retrieval.py#L120

Batch_size=1 will not work because there needs to be at least 1 negative sample.

Hi, LiJunnan1992, It works for me when I set the 'batchsize' to 2. I set 'batchsize' to 1 at first, because of fear of OOM. Thanks for your reply ; )

zhihuacc commented 1 year ago

Hi, you can try to add a small positive number to the weights as done here:

https://github.com/salesforce/ALBEF/blob/fb384204472feab2a85bd4f5790d7889c31672c9/models/model_retrieval.py#L120

Batch_size=1 will not work because there needs to be at least 1 negative sample.

Hi, I'm facing the same issue with model_pretrain.py with batch size 512 with 8 gpus. I added a small epsilon 1e-4, the possibility of error reduced but still could happen. I'm wondering why the error can happen because I think softmax() can make sure the sum is 1, right ?