Get NaN loss while training resnet50 model with 1 gpu on multi-nodes, using nv_peer_memory and gdr

tensorflow / benchmarks

A benchmark framework for Tensorflow

Apache License 2.0

1.15k stars 634 forks source link

Hi I get NaN loss while training a resnet50 model with 1 V100 on 2 nodes. Besides, I use nv_peer_memory and GDR. The strategy I used is the Parameter Server. My command line is python3 tf_cnn_benchmarks.py \ --ps_hosts=${PS1},${PS2} \ --worker_hosts=${WORKER1},${WORKER2} \ --controller_host=${CONTROLLER_HOST} \ --job_name=worker \ --variable_update=${VARIABLE_UPDATE} \ --local_parameter_device=cpu \ --use_fp16 --batch_size=${BATCH_SIZE_PER_GPU} \ --force_gpu_compatible \ --num_gpus=1 \ --model=${TRAIN_MODEL} \ --task_index=0 \ --server_protocol=${PROTOCOL} \ --all_reduce_spec=${ALL_REDUCE_ALG} and the result is as follows: Then I use 2 V100 per node the result is as follows: the loss become normal. So I wonder that why the loss become NaN, can anyone please explain the results? Or anyone can figure out in which step the loss become NaN?

I do not test the benchmark script on multi-node anymore. I would suggest the "official" ResNet50 example which we are now testing nightly: https://github.com/tensorflow/models/tree/master/official/resnet

That is a TF 1.0 example that also works on TF 2.0. It may move in the future to a legacy example directory. I have been moving people toward our official example/reference models as I have people that are interested in answering questions about that code. I run nightly single node tests for tf_cnn_benchmarks but only because I have not bothered to stop the job; but beyond that getting help to debug is difficult. If you have issues with the "Official" resent50 please contact me directly @tfboyd and I will try to get you answers.

tensorflow / benchmarks

Get NaN loss while training resnet50 model with 1 gpu on multi-nodes, using nv_peer_memory and gdr #403