salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.45k stars 193 forks source link

ValueError: Default process group has not been initialized, please make sure to call init_process_group. #139

Open nuistZPZ opened 1 month ago

nuistZPZ commented 1 month ago

您好,我在使用的时候遇到了问题,我发现如果不使用分布式训练就需要修改源代码。直接在命令行中令distributed为False并不能解决该类问题,请问应该如何解决分布式训练带来的问题,只能注释掉相关代码来解决吗。 ----translation----- Hello, I'm having problems with it and I realized that I need to modify the source code if I don't use distributed training. Directly making distributed to False on the command line does not solve this type of problem, how should I solve the problem caused by distributed training, can I only comment out the relevant code to solve it.

----log----- Traceback (most recent call last): File "F:\Projects\Multi Modal\ALBEF\Pretrain.py", line 203, in main(args, config) File "F:\Projects\Multi Modal\ALBEF\Pretrain.py", line 175, in main dist.barrier()
File "F:\anaconda3\envs\albef\lib\site-packages\torch\distributed\c10d_logger.py", line 72, in wrapper return func(*args, **kwargs) File "F:\anaconda3\envs\albef\lib\site-packages\torch\distributed\distributed_c10d.py", line 3428, in barrier opts.device = _get_pg_default_device(group) File "F:\anaconda3\envs\albef\lib\site-packages\torch\distributed\distributed_c10d.py", line 644, in _get_pg_default_device group = group or _get_default_group() File "F:\anaconda3\envs\albef\lib\site-packages\torch\distributed\distributed_c10d.py", line 977, in _get_default_group raise ValueError( ValueError: Default process group has not been initialized, please make sure to call init_process_group.