hzwer / ECCV2022-RIFE

ECCV2022 - Real-Time Intermediate Flow Estimation for Video Frame Interpolation
MIT License
4.48k stars 446 forks source link

train error (nccl) torch.distributed #280

Open bis70 opened 2 years ago

bis70 commented 2 years ago

I think your work is great. Unfortunately I do not manage to train the NN. The following error message appears in the console:

(rife) PS C:\Users\C\PycharmProjects\RIFE> python -m torch.distributed.launch --nproc_per_node=1 train.py --world_size=1

Traceback (most recent call last): File "train.py", line 146, in torch.distributed.init_process_group(backend="nccl", world_size=args.world_size) AttributeError: module 'torch.distributed' has no attribute 'init_process_group' Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\rife\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\ProgramData\Anaconda3\envs\rife\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "C:\ProgramData\Anaconda3\envs\rife\lib\site-packages\torch\distributed\launch.py", line 261, in main() File "C:\ProgramData\Anaconda3\envs\rife\lib\site-packages\torch\distributed\launch.py", line 256, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['C:\ProgramData\Anaconda3\envs\rife\python.exe', '-u', 'train.py', '--local_rank=0', '--world_size=1']' returned non-zero exit status 1.

hzwer commented 2 years ago

Please check https://github.com/pytorch/examples/issues/467. It seems to be a OS-torch related issue.

bis70 commented 2 years ago

The link leads to a MacOS problem, I use Windows. Well maybe I will find the problem somewhere else