Can not run BERT-Large training successfully on bare metal

zhixingheyi-tian commented 3 years ago

We run BERT-Large training on bare metal ubuntu server. The log have no errors, but also no training logs, it is confusing.

command:

python ./launch_benchmark.py \
    --model-name=bert_large \
    --precision=fp32 \
    --mode=training \
    --framework=tensorflow \
    --batch-size=24 \
    --benchmark-only \
    --data-location=$BERT_LARGE_DIR \
    --num-inter-threads=1 \
    -- train-option=SQuAD  DEBIAN_FRONTEND=noninteractive   config_file=$BERT_LARGE_DIR/bert_config.json init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt     vocab_file=$BERT_LARGE_DIR/vocab.txt train_file=$SQUAD_DIR/train-v1.1.json     predict_file=$SQUAD_DIR/dev-v1.1.json      do-train=True learning-rate=1.5e-5   max-seq-length=384     do_predict=True warmup-steps=0     num_train_epochs=0.1     doc_stride=128      do_lower_case=False     experimental-gelu=False     mpi_workers_sync_gradients=True

The log:

INFO:tensorflow:Graph was finalized.
I0625 09:40:30.595448 140247941625664 monitored_session.py:246] Graph was finalized.
2021-06-25 09:40:30.595915: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-25 09:40:30.764862: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2892875000 Hz
2021-06-25 09:40:30.767997: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c703127e80 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-25 09:40:30.768068: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:tensorflow:Running local_init_op.
I0625 09:40:50.980941 140247941625664 session_manager.py:505] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0625 09:40:51.142987 140247941625664 session_manager.py:508] Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
I0625 09:41:02.433922 140247941625664 basic_session_run_hooks.py:614] Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /home/shen/models/benchmarks/common/tensorflow/logs/model.ckpt.
I0625 09:41:02.434337 140247941625664 basic_session_run_hooks.py:618] Saving checkpoints for 0 into /home/shen/models/benchmarks/common/tensorflow/logs/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
I0625 09:41:08.454857 140247941625664 basic_session_run_hooks.py:626] Calling checkpoint listeners after saving checkpoint 0...
INFO:Running SQuAD...!
----------------------------Run command-------------------------------------

So there are no training result in the log.

@dmsuehir @ashahba would you please help troubleshoot

Thanks

zhixingheyi-tian commented 3 years ago

Hi @dmsuehir , do you have ideas for this issue.

Thanks

dmsuehir commented 3 years ago

@zhixingheyi-tian Is this the same issue that's being discussed in the email thread with Wei? It sounded like the next steps were to make sure that you are pip installing intel-tensorflow instead of just tensorflow.

sramakintel commented 7 months ago

@zhixingheyi-tian: can you confirm if the issue is resolved? If not, can you try our latest optimizations for BERT-Large training here: https://github.com/IntelAI/models/tree/r3.1/quickstart/language_modeling/tensorflow/bert_large/training/cpu ?

intel / ai-reference-models

Can not run BERT-Large training successfully on bare metal #86