mlcommons / training_results_v0.5

This repository contains the results and code for the MLPerf™ Training v0.5 benchmark.
https://mlcommons.org/en/training-normal-05/
Apache License 2.0
35 stars 54 forks source link

Will Object Detection benchmark run on Tesla P100s? or DGX-1 & DGX-2 only? #8

Open dfeddema opened 5 years ago

dfeddema commented 5 years ago

I am wondering if anyone has run the object detecton benchmark with pytorch on Tesla P100s. Here's a link to the code I want to run on P100s, but the config files are for DGX-1 & DGX-2 only. Are there simple config changes that will allow this to run on Tesla P100? https://github.com/mlperf/results/tree/master/v0.5.0/nvidia/submission/code/object_detection/pytorch

aakashkardam commented 5 years ago

If you just get rid of the -m torch.distributed.launch --nproc_per_node $SLURM_NTASKS_PER_NODE $MULTI_NODE in the run_and_time.sh script and rebuild the docker image, I think you should be able to run the same benchmark on a single GPU without changing anything in the config files.

I could successfully get it to start training on a single TeslaV100 (1 of the 8 GPUs on DGX1) using the same trick and may be you can do the same on the Tesla P100s. You might want to reduce the batch-size considering less memory on Tesla P100 (~16 GB) compared to Tesla V100(~32GB).

dfeddema commented 5 years ago

I made the change suggested above and ran it on a single V100. It failed with "loss: nan (nan) loss_classifier: nan (nan) loss_box_reg: nan (nan)" and soon after the nans a "ZeroDivisionError: float division by zero" fatal error.

Would you explain a bit more regarding reducing the batch size. I saw several config parameters involving "batch" sizing.

Thanks!

aakashkardam commented 5 years ago

I could reproduce the error. NAN losses suggests that the model is diverging and there could be multiple factors in play like larger learning rate. The default learning rate for this problem when using 8 gpus is 0.04. I tried a few things like reducing the batch size to 2 and learning rate to 0.0025. Take a look at https://github.com/facebookresearch/maskrcnn-benchmark/issues/295

Try following the suggestions here https://github.com/facebookresearch/maskrcnn-benchmark#single-gpu-training

I am also attaching the config file I used in order to get it to run on a single GPU. It seems to be running for more than 2 days now and still hasn't finished but I also didn't get any losses to appear as NAN

#!/bin/bash

EXTRA_CONFIG=( SOLVER.BASE_LR 0.0025 SOLVER.MAX_ITER 720000 SOLVER.STEPS "(480000, 640000)" MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 2000 )

DGXNNODES=1 DGXSYSTEM=DGX1 WALLTIME=12:00:00

DGXNGPU=8 DGXSOCKETCORES=20 DGXHT=2 # HT is on is 2, HT off is 1 DGXIBDEVICES=''

UPDATE - It finished training without any NAN errors.