train problem - Githubissues

plemeri / InSPyReNet

Official PyTorch implementation of Revisiting Image Pyramid Structure for High Resolution Salient Object Detection (ACCV 2022)

MIT License

449 stars 69 forks source link

train problem #2

Closed cenchaojun closed 1 year ago

cenchaojun commented 2 years ago

thank you for your nice work.👍. i try to run this code, but i got this error as follow. RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=719709, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807565 milliseconds before timing out. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1530483) of binary: /home/cenchaojun/.conda/envs/sod/bin/python3.8 what could be the reason for this？ best wishes to you.❤️

plemeri commented 2 years ago

Hello, thank you for your attention! I think you're trying to train with DDP. If you train with single GPU, it won't happen especially for the error related to the torch.distributed.multiprocessing. However, even we do not officially support DDP, we actually made our work compatible with DDP.

Please try with following command

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port=$RANDOM run/Train.py --config InSPyReNet_SwinB.yaml --verbose --debug

Note that the above command is for 4 GPUs. Please change CUDA_VISIBLE_DEVICES and nproc_per_node properly if your machine has more or less GPUs, or if you just want to train with different number of GPUs. Also, please note that this is tested only for torch==1.8.1.

Please let me know if you have more questions.

cenchaojun commented 2 years ago

Hello, thank you for your attention! I think you're trying to train with DDP. If you train with single GPU, it won't happen especially for the error related to the torch.distributed.multiprocessing. However, even we do not officially support DDP, we actually made our work compatible with DDP.

Please try with following command
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 --master_port=$RANDOM run/Train.py --config InSPyReNet_SwinB.yaml --verbose --debug
Note that the above command is for 4 GPUs. Please change CUDA_VISIBLE_DEVICES and nproc_per_node properly if your machine has more or less GPUs, or if you just want to train with different number of GPUs. Also, please note that this is tested only for torch==1.8.1.

Please let me know if you have more questions.

I try to use this command, but it still does not work.

plemeri commented 2 years ago

Can you tell me your pytorch and cudatoolkit version? I check again with my machine and it works for me.

cenchaojun commented 2 years ago

Can you tell me your pytorch and cudatoolkit version? I check again with my machine and it works for me.

My Pytorch version is 1.12.1. Although the version of the Pytorch library is not 1.8.1, I think the Pytorch=1.12.1 version could be compatible with the lower version

plemeri commented 2 years ago

I guess so, but as I know, torch.distributed.launch will be deprecated. As I run the DDP on my machine with my environmental settings, it says The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run. You might be right about the backward compatibility, but I rather recommend to just create another virtual environment with lower torch/cudatoolkit version to verify if it still doesn't work.

Sorry that I cannot help you with latest torch version.

cenchaojun commented 2 years ago

Thank you very much for your reply. I will try it on the 1.8.1 pytorch version. thank you .❤️