I have tested in object detection with nvidia code.The difference was I only used two nodes instead of eight, But the following error occurred,
random_number_generator,
File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 190, in do_train
use_distributed, use_amp=arguments["use_amp"]
File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 105, in train_one_epoch
next(self.gen)
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/opt.py", line 50, in scale_loss
scaled_losses.backward()
File "/opt/conda/lib/python3.6/contextlib.py", line 88, in exit
self._optimizer.param_groups, loss_scale)
next(self.gen)
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/opt.py", line 50, in scale_loss
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 38, in unscale_and_u
pdate
self._optimizer.param_groups, loss_scale)
/ scale)
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 38, in unscale_and_u
pdate
/ scale)
ZeroDivisionError: float division by zero
ZeroDivisionError: float division by zero
Traceback (most recent call last):
File "tools/train_net.py", line 328, in
main()
File "tools/train_net.py", line 319, in main
model = train(cfg, random_number_generator, args.local_rank, args.distributed, args, args.fp16)
File "tools/train_net.py", line 173, in train
random_number_generator,
File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 190, in do_train
use_distributed, use_amp=arguments["use_amp"]
DGXNGPU=8
DGXSOCKETCORES=14
DGXHT=2 # HT is on is 2, HT off is 1
I suspect that EXTRA CONFIG will be modified. Can someone give me some guidance? Thanks in advance!
I have tested in object detection with nvidia code.The difference was I only used two nodes instead of eight, But the following error occurred, random_number_generator, File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 190, in do_train use_distributed, use_amp=arguments["use_amp"] File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 105, in train_one_epoch next(self.gen) File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/opt.py", line 50, in scale_loss scaled_losses.backward() File "/opt/conda/lib/python3.6/contextlib.py", line 88, in exit self._optimizer.param_groups, loss_scale) next(self.gen) File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/opt.py", line 50, in scale_loss File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 38, in unscale_and_u pdate self._optimizer.param_groups, loss_scale)
My config_DGX1_MULTI.sh is following:
DL params
EXTRA_PARAMS="--min_bbox_map 0.377 --min_mask_map 0.339" EXTRA_CONFIG=( "SOLVER.BASE_LR" "0.16" "SOLVER.MAX_ITER" "40000" "SOLVER.WARMUP_FACTOR" "0.000256" "SOLVER.WARMUP_ITERS" "625" "SOLVER.WARMUP_METHOD" "mlperf_linear" "SOLVER.STEPS" "(9000, 12000)" "DATALOADER.IMAGES_PER_BATCH_TRAIN" "2" "MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN" "2000" )
System run parms
DGXNNODES=2 DGXSYSTEM=DGX1_multi WALLTIME=12:00:00
System config params
DGXNGPU=8 DGXSOCKETCORES=14 DGXHT=2 # HT is on is 2, HT off is 1 I suspect that EXTRA CONFIG will be modified. Can someone give me some guidance? Thanks in advance!