pavancm / CONTRIQUE

Official implementation for "Image Quality Assessment using Contrastive Learning"
122 stars 12 forks source link

About distributed training #7

Closed yongZheng1723 closed 2 years ago

yongZheng1723 commented 2 years ago

Is the distributed training set up by you single-machine multi-card or multi-machine multi-card? I ran into NCCL errors training on two Gpus on the same machine

pavancm commented 2 years ago

The code should work in either case. Please ensure all the dependencies are installed in their correct versions. Distributed training can be done multiple ways in PyTorch. The code which I have shared creates a file 'sharedfile' in order to connect with multiple nodes as I employ init_method=cur_dir in dist_init_process_group in train.py. Please refer to PyTorch Distributed training documentation for more details.

yongZheng1723 commented 2 years ago

Thank you for your reply! When I ran your code, there was an NCCL error. Later, I used TCP communication as the parameter, but I still received an error. May I ask whether 'sharedfile' was written locally by you or the code will be generated automatically?

pavancm commented 2 years ago

That will be created by the program. Make sure there does not exist 'sharedfile' file in your working directory before running. The program will fail if there is already a file named 'sharedfile'.

yongZheng1723 commented 2 years ago

I ran the following commands on a machine with two 2080Ti:CUDA_VISIBLE_DEVICES=0 python3 train.py --nodes 2 --nr 0 --batch_size 16 --lr 0.6 --epochs 25 and CUDA_VISIBLE_DEVICES=1 python3 train.py --nodes 2 --nr 1 --batch_size 16 --lr 0.6 --epochs 25 But the following error occurs: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370116979/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled cuda error, NCCL version 2.7.8 I will try to solve this problem. Thanks again for your reply!