sovit-123 / fasterrcnn-pytorch-training-pipeline

PyTorch Faster R-CNN Object Detection on Custom Dataset
MIT License
223 stars 75 forks source link

question about running in distributed training in a single node with multiple gpus #63

Open yochju opened 1 year ago

yochju commented 1 year ago

First, thanks for the great work.

When I run without distributed mode, it works fine: This means that it created "output/training/" folder and saved the training results there. However, when I try to run in a distributed mode, it does not work in my case:

  1. Using argument option --dist_url 'tcp://localhost:23456' or 'tcp://127.0.0.1:23456' => In this case, it does not even generate the output directory "output/training/"
  2. Using default option --dist_url "env://", which means I did nothing but to use default config => In this case, it created an output directory correctly but training is not happening and there are some files like "events.out.tfevents.1682197067.pid.0", "opt.yaml" and "train.log". Although "opt.yaml" has information about training parameters, the size of "train.log" is 0, i.e. it contains nothing. The thing I noticed is that in the file "opt.yaml" the value of gpu is 5, while our workstation has 10 GPUs. What could be the problem and how could be solved?
sovit-123 commented 1 year ago

Hello @yochju I am not sure how it will behave with 10 GPUs. I have only tested it with 2 GPUs, but I guess you can safely run training on 4 GPUs as well. Can you try the following? First, execute this command on the terminal: export CUDA_VISIBLE_DEVICES=0,1,2,3

Then the following command: python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py <rest of the command>

yochju commented 1 year ago

The command that you recommend seems working, it generated the output directory and training results are saved. Previously I used the argument --nproc_per_node=10 for 10 GPUs and --nproc_per_node=4 for 4 GPUs, respectively, which didn't work. So my question would be why --npeoc_per_node=2 is working and how is this argument --nproc_per_node related to the number of GPUs?

sovit-123 commented 1 year ago

@yochju Actually, by mistake I gave --nproc_per_node=2 for 4 GPUS, it should have been 2. It's interesting that it is working with --nproc_per_node=2. I will try to check this on a multi-GPU system. May take some time to debug though.

yochju commented 1 year ago

What strange is that, seemingly, it only works with --nproc_per_node=2 for how many CUDA_VISIBLE_DEVICES are available, I tried with 2,4,8,10 GPUs. For 4 GPUs, all the following arguments world size, --nproc_per_node, workers should be 4, aren't they? FYI, I attach log below when I tried to run with --nproc_per_node=4

= =========================================================== /home/juyongch/GPR/faster-rcnn/pyscript/train.py FAILED

Failures: [1]: time : 2023-04-23_19:09:39 host : rank : 1 (local_rank: 1) exitcode : 1 (pid: 3928579) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html


Root Cause (first observed failure): [0]: time : 2023-04-23_19:09:39 host : rank : 0 (local_rank: 0) exitcode : 1 (pid: 3928578) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html = ===========================================================

sovit-123 commented 1 year ago

Trying to fix this.

yochju commented 1 year ago

Glad to hear that. How is it going?

sovit-123 commented 1 year ago

I tried a few experiments with 2 GPUs first just to ensure that there are no issues. However, it will be near this weekend that I will try to run on more than one GPU.