mlcommons / training_results_v0.5

This repository contains the results and code for the MLPerf™ Training v0.5 benchmark.
https://mlcommons.org/en/training-normal-05/
Apache License 2.0
35 stars 54 forks source link

【object_detection】ZeroDivisionError: float division by zero #6

Open liuyanfeng543 opened 5 years ago

liuyanfeng543 commented 5 years ago

I have tested in object detection with nvidia code.The difference was I only used two nodes instead of eight, But the following error occurred, random_number_generator, File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 190, in do_train use_distributed, use_amp=arguments["use_amp"] File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 105, in train_one_epoch next(self.gen) File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/opt.py", line 50, in scale_loss scaled_losses.backward() File "/opt/conda/lib/python3.6/contextlib.py", line 88, in exit self._optimizer.param_groups, loss_scale) next(self.gen) File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/opt.py", line 50, in scale_loss File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 38, in unscale_and_u pdate self._optimizer.param_groups, loss_scale)

  1. / scale) File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 38, in unscale_and_u pdate
  2. / scale) ZeroDivisionError: float division by zero ZeroDivisionError: float division by zero Traceback (most recent call last): File "tools/train_net.py", line 328, in main() File "tools/train_net.py", line 319, in main model = train(cfg, random_number_generator, args.local_rank, args.distributed, args, args.fp16) File "tools/train_net.py", line 173, in train random_number_generator, File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 190, in do_train use_distributed, use_amp=arguments["use_amp"]

My config_DGX1_MULTI.sh is following:

DL params

EXTRA_PARAMS="--min_bbox_map 0.377 --min_mask_map 0.339" EXTRA_CONFIG=( "SOLVER.BASE_LR" "0.16" "SOLVER.MAX_ITER" "40000" "SOLVER.WARMUP_FACTOR" "0.000256" "SOLVER.WARMUP_ITERS" "625" "SOLVER.WARMUP_METHOD" "mlperf_linear" "SOLVER.STEPS" "(9000, 12000)" "DATALOADER.IMAGES_PER_BATCH_TRAIN" "2" "MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN" "2000" )

System run parms

DGXNNODES=2 DGXSYSTEM=DGX1_multi WALLTIME=12:00:00

System config params

DGXNGPU=8 DGXSOCKETCORES=14 DGXHT=2 # HT is on is 2, HT off is 1 I suspect that EXTRA CONFIG will be modified. Can someone give me some guidance? Thanks in advance!