salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.45k stars 193 forks source link

RuntimeError: invalid multinomial distribution (sum of probabilities <= 0) #131

Open HWH-2000 opened 11 months ago

HWH-2000 commented 11 months ago
    loss_mlm, loss_ita, loss_itm = model(image, text_input, alpha = alpha)
  File "/root/anaconda3/envs/deeplearning3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/envs/deeplearning3.9/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/root/anaconda3/envs/deeplearning3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/models/model_pretrain_id.py", line 156, in forward
    neg_idx = torch.multinomial(weights_t2i[b], 1).item()
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 771 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 773 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 774 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 772) of binary: /root/anaconda3/envs/deeplearning3.9/bin/python
Traceback (most recent call last):
  File "/root/anaconda3/envs/deeplearning3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/anaconda3/envs/deeplearning3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/anaconda3/envs/deeplearning3.9/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/root/anaconda3/envs/deeplearning3.9/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/root/anaconda3/envs/deeplearning3.9/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/root/anaconda3/envs/deeplearning3.9/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/root/anaconda3/envs/deeplearning3.9/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/deeplearning3.9/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

I encountered this problem when executing the code, it is very strange, my code stops at 6 epoch, and prompts error