Intel® AI Reference Models: contains Intel optimizations for running deep learning workloads on Intel® Xeon® Scalable processors and Intel® Data Center GPUs
Apache License 2.0
683
stars
220
forks
source link
The variable "end_training" in Bert_Large training is wrongly used. #170
In the code below, the variable "end_training" is defined with boolean type to decide when to end the training.
https://github.com/IntelAI/models/blob/cdd842a33eb9d402ff18bfb79bd106ae132a8e99/models/language_modeling/pytorch/bert_large/training/gpu/run_pretrain_mlperf.py#L838
In the code below to calculate the one iteration training time, the variable "end_training" is wrongly re-used to record the end training time. https://github.com/IntelAI/models/blob/cdd842a33eb9d402ff18bfb79bd106ae132a8e99/models/language_modeling/pytorch/bert_large/training/gpu/run_pretrain_mlperf.py#L1006
"end_training" is set with a non-zero value in the code line 1006. As a result, after one data file is used for training, the training exits here and will never go to next data file. https://github.com/IntelAI/models/blob/cdd842a33eb9d402ff18bfb79bd106ae132a8e99/models/language_modeling/pytorch/bert_large/training/gpu/run_pretrain_mlperf.py#L1079