The variable "end_training" in Bert_Large training is wrongly used. - Githubissues

intel / ai-reference-models

Intel® AI Reference Models: contains Intel optimizations for running deep learning workloads on Intel® Xeon® Scalable processors and Intel® Data Center GPUs

Apache License 2.0

683 stars 220 forks source link

The variable "end_training" in Bert_Large training is wrongly used. #170

Open taotod opened 8 months ago

taotod commented 8 months ago

In the code below, the variable "end_training" is defined with boolean type to decide when to end the training.

https://github.com/IntelAI/models/blob/cdd842a33eb9d402ff18bfb79bd106ae132a8e99/models/language_modeling/pytorch/bert_large/training/gpu/run_pretrain_mlperf.py#L838

In the code below to calculate the one iteration training time, the variable "end_training" is wrongly re-used to record the end training time. https://github.com/IntelAI/models/blob/cdd842a33eb9d402ff18bfb79bd106ae132a8e99/models/language_modeling/pytorch/bert_large/training/gpu/run_pretrain_mlperf.py#L1006

"end_training" is set with a non-zero value in the code line 1006. As a result, after one data file is used for training, the training exits here and will never go to next data file. https://github.com/IntelAI/models/blob/cdd842a33eb9d402ff18bfb79bd106ae132a8e99/models/language_modeling/pytorch/bert_large/training/gpu/run_pretrain_mlperf.py#L1079

sramakintel commented 7 months ago

@taotod could you submit a PR to address the workaround if possible?