question about running in distributed training in a single node with multiple gpus

yochju commented 1 year ago

First, thanks for the great work.

When I run without distributed mode, it works fine: This means that it created "output/training/" folder and saved the training results there. However, when I try to run in a distributed mode, it does not work in my case:

Using argument option --dist_url 'tcp://localhost:23456' or 'tcp://127.0.0.1:23456' => In this case, it does not even generate the output directory "output/training/"
Using default option --dist_url "env://", which means I did nothing but to use default config => In this case, it created an output directory correctly but training is not happening and there are some files like "events.out.tfevents.1682197067.pid.0", "opt.yaml" and "train.log". Although "opt.yaml" has information about training parameters, the size of "train.log" is 0, i.e. it contains nothing. The thing I noticed is that in the file "opt.yaml" the value of gpu is 5, while our workstation has 10 GPUs. What could be the problem and how could be solved?

sovit-123 commented 1 year ago

Hello @yochju I am not sure how it will behave with 10 GPUs. I have only tested it with 2 GPUs, but I guess you can safely run training on 4 GPUs as well. Can you try the following? First, execute this command on the terminal: export CUDA_VISIBLE_DEVICES=0,1,2,3

Then the following command: python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py <rest of the command>

yochju commented 1 year ago

The command that you recommend seems working, it generated the output directory and training results are saved. Previously I used the argument --nproc_per_node=10 for 10 GPUs and --nproc_per_node=4 for 4 GPUs, respectively, which didn't work. So my question would be why --npeoc_per_node=2 is working and how is this argument --nproc_per_node related to the number of GPUs?

sovit-123 commented 1 year ago

@yochju Actually, by mistake I gave --nproc_per_node=2 for 4 GPUS, it should have been 2. It's interesting that it is working with --nproc_per_node=2. I will try to check this on a multi-GPU system. May take some time to debug though.

yochju commented 1 year ago

What strange is that, seemingly, it only works with --nproc_per_node=2 for how many CUDA_VISIBLE_DEVICES are available, I tried with 2,4,8,10 GPUs. For 4 GPUs, all the following arguments world size, --nproc_per_node, workers should be 4, aren't they? FYI, I attach log below when I tried to run with --nproc_per_node=4

- /home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/site-packages/torch/distributed/launch.py:188: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

FutureWarning, WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
| distributed init (rank 2): env:// | distributed init (rank 3): env:// | distributed init (rank 1): env:// | distributed init (rank 0): env:// device cuda Traceback (most recent call last): File "/home/juyongch/GPR/faster-rcnn/pyscript/train.py", line 540, in Traceback (most recent call last): File "/home/juyongch/GPR/faster-rcnn/pyscript/train.py", line 540, in main(args) File "/home/juyongch/GPR/faster-rcnn/pyscript/train.py", line 216, in main main(args) File "/home/juyongch/GPR/faster-rcnn/pyscript/train.py", line 216, in main OUT_DIR = set_training_dir(args['name']) File "/home/juyongch/GPR/faster-rcnn/pyscript/utils/general.py", line 319, in set_training_dir OUT_DIR = set_training_dir(args['name']) File "/home/juyongch/GPR/faster-rcnn/pyscript/utils/general.py", line 319, in set_training_dir os.makedirs('outputs/training') File "/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/os.py", line 223, in makedirs os.makedirs('outputs/training') File "/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/os.py", line 223, in makedirs mkdir(name, mode) FileExistsError: [Errno 17] File exists: 'outputs/training' mkdir(name, mode) FileExistsError: [Errno 17] File exists: 'outputs/training' /home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/site-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. f"The parameter '{pretrained_param}' is deprecated since 0.13 and may be removed in the future, " /home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/site-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. f"The parameter '{pretrained_param}' is deprecated since 0.13 and may be removed in the future, " /home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=FasterRCNN_ResNet50_FPN_Weights.COCO_V1. You can also use weights=FasterRCNN_ResNet50_FPN_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) /home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=FasterRCNN_ResNet50_FPN_Weights.COCO_V1. You can also use weights=FasterRCNN_ResNet50_FPN_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3928580 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3928581 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3928578) of binary: /home/juyongch/anaconda3/envs/faster_rcnn_37/bin/python3 Traceback (most recent call last): File "/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in main() File "/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run )(*cmd_args) File "/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

= =========================================================== /home/juyongch/GPR/faster-rcnn/pyscript/train.py FAILED

Failures: [1]: time : 2023-04-23_19:09:39 host : rank : 1 (local_rank: 1) exitcode : 1 (pid: 3928579) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-04-23_19:09:39 host : rank : 0 (local_rank: 0) exitcode : 1 (pid: 3928578) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html = ===========================================================

sovit-123 commented 1 year ago

Trying to fix this.

yochju commented 1 year ago

Glad to hear that. How is it going?

sovit-123 commented 1 year ago

I tried a few experiments with 2 GPUs first just to ensure that there are no issues. However, it will be near this weekend that I will try to run on more than one GPU.

sovit-123 / fasterrcnn-pytorch-training-pipeline

question about running in distributed training in a single node with multiple gpus #63