Closed richardkxu closed 3 years ago
It seems your master port has been used, you should change it like https://github.com/open-mmlab/mmaction2/blob/master/tools/slurm_train.sh#L3
Hi @dreamerlin ,
I got the same error when running the following cmd on 1 GPU on a single machine with 4 GPUs:
MASTER_PORT=$((12000 + $RANDOM % 20000));CUDA_VISIBLE_DEVICES=0; python mmaction2/tools/train.py configs/recognition/r2plus1d/r2plus1d_r34_8x8x1_180e_ucf101_rgb.py --validate --seed 0 --deterministic
I am not using distributed training, nor slurm. I use the same cmd to run irCSN script and it works without setting the MASTER_PORT
. I am wondering if there is anything different between the r2plus1d runtime config and irCSN runtime config?
Maybe you can use distributed training and set GPUS=1 with different ports.
That does not fix the error either. I don't think it is related to the port being used. I used the same cmd to run irCSN without any problem. I think there might be a bug in the r2plus1d dist implementation or runtime config.
Hi @richardkxu this is caused by SyncBN https://github.com/open-mmlab/mmaction2/blob/master/configs/_base_/models/r2plus1d_r34.py#L11 You can replace it by the usual BN. SyncBN thinks that you are in the distributed environment, so pytorch will try to get the worldrank. And then trigger the error because distributed has not been initialized.
It is hard to debug because the **new**
config system buries the important info to somewhere no one ever notices.
Thanks @innerlee! The error was fixed after replacing norm_cfg=dict(type='SyncBN', requires_grad=True, eps=1e-3),
with norm_cfg=dict(type='BN3d', requires_grad=True, eps=1e-3),
. Hopefully this patch can be added to the next release.
@richardkxu I had the same exact error. Thanks a tonne, @innerlee, your solution solved the error!!!
Describe the bug
Hi I have encountered an error complaining about pytorch dist training not being initialized properly when finetuning r2plus1d model on ucf101 dataset. I followed the "fine tuning tutorial' to setup a new config file for ucf101. The only changes are the dataset and the same config works for finetuning irCSN on ucf101. I think the r2plus1d and irCSN have the same runtime config and they use the same mmaction2/tools/train.py. I am really confused on where this error is coming from and how to fix it? Thank you!
Reproduction
I did not modify any built-in r2plus1d config files and my config file is the following: