Open zhixingheyi-tian opened 3 years ago
The BERT large squad training log will have values like INFO:tensorflow:examples/sec: ...
. This number can be multiplied by the number of MPI processes (in your example, that's 2 since you have --mpi_num_processes=2
) to get the total examples per second.
@zhixingheyi-tian can you try our latest optimizations for tensorflow bert-large by referring to the link here https://www.intel.com/content/www/us/en/developer/articles/containers/cpu-reference-model-containers.html
I encountered some confusion when I followed the guide--https://github.com/IntelAI/models/tree/master/benchmarks/language_modeling/tensorflow/bert_large to run training workload.
Running command:
Result:
I didn’t see the “throughput((num_processed_examples-threshod_examples)/Elapsedtime)” information like inference workload from the training log. I also read the script code: models/models/language_modeling/tensorflow/bert_large/training/fp32/run_squad.py, I have not found about “throughput”. But the ./models/models/language_modeling/tensorflow/bert_large/inference/run_squad.py used by inference has code about ” throughput((num_processed_examples-threshod_examples)/Elapsedtime)”.
So how to evaluate the performance number of Bert-Large training. There is neither "throughput" nor "Elapsedtime" in the log and running script?
@ashahba @dmsuehir
Thanks