Open yochju opened 1 year ago
Hello @yochju
I am not sure how it will behave with 10 GPUs. I have only tested it with 2 GPUs, but I guess you can safely run training on 4 GPUs as well. Can you try the following?
First, execute this command on the terminal:
export CUDA_VISIBLE_DEVICES=0,1,2,3
Then the following command:
python -m torch.distributed.launch --nproc_per_node=2 --use_env train.py <rest of the command>
The command that you recommend seems working, it generated the output directory and training results are saved. Previously I used the argument --nproc_per_node=10 for 10 GPUs and --nproc_per_node=4 for 4 GPUs, respectively, which didn't work. So my question would be why --npeoc_per_node=2 is working and how is this argument --nproc_per_node related to the number of GPUs?
@yochju
Actually, by mistake I gave --nproc_per_node=2
for 4 GPUS, it should have been 2. It's interesting that it is working with --nproc_per_node=2
. I will try to check this on a multi-GPU system. May take some time to debug though.
What strange is that, seemingly, it only works with --nproc_per_node=2
for how many CUDA_VISIBLE_DEVICES
are available, I tried with 2,4,8,10 GPUs.
For 4 GPUs, all the following arguments world size
, --nproc_per_node
, workers
should be 4, aren't they?
FYI, I attach log below when I tried to run with --nproc_per_node=4
-
/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/site-packages/torch/distributed/launch.py:188: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank
argument to be set, please
change it to read from os.environ['LOCAL_RANK']
instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
FutureWarning, WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
| distributed init (rank 2): env://
| distributed init (rank 3): env://
| distributed init (rank 1): env://
| distributed init (rank 0): env://
device cuda
Traceback (most recent call last):
File "/home/juyongch/GPR/faster-rcnn/pyscript/train.py", line 540, in None
for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=FasterRCNN_ResNet50_FPN_Weights.COCO_V1
. You can also use weights=FasterRCNN_ResNet50_FPN_Weights.DEFAULT
to get the most up-to-date weights.
warnings.warn(msg)
/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None
for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=FasterRCNN_ResNet50_FPN_Weights.COCO_V1
. You can also use weights=FasterRCNN_ResNet50_FPN_Weights.DEFAULT
to get the most up-to-date weights.
warnings.warn(msg)
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3928580 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3928581 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3928578) of binary: /home/juyongch/anaconda3/envs/faster_rcnn_37/bin/python3
Traceback (most recent call last):
File "/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/juyongch/anaconda3/envs/faster_rcnn_37/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in
= =========================================================== /home/juyongch/GPR/faster-rcnn/pyscript/train.py FAILED
Failures: [1]: time : 2023-04-23_19:09:39 host : rank : 1 (local_rank: 1) exitcode : 1 (pid: 3928579) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-04-23_19:09:39 host : rank : 0 (local_rank: 0) exitcode : 1 (pid: 3928578) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html = ===========================================================
Trying to fix this.
Glad to hear that. How is it going?
I tried a few experiments with 2 GPUs first just to ensure that there are no issues. However, it will be near this weekend that I will try to run on more than one GPU.
First, thanks for the great work.
When I run without distributed mode, it works fine: This means that it created "output/training/" folder and saved the training results there.
However, when I try to run in a distributed mode, it does not work in my case: