mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.62k stars 561 forks source link

Bert pretrain script message "Could not find trained model in model_dir: /tmp/output/" #642

Closed mahmoodn closed 4 months ago

mahmoodn commented 1 year ago

Hi, When I run the final Bert command:

$ TF_XLA_FLAGS='--tf_xla_auto_jit=2' python3 run_pretraining.py \
--bert_config_file=./cleanup_scripts/wiki/bert_config.json \
--output_dir=/tmp/output/ \
--input_file=./cleanup_scripts/tfrecord/eval_10k -\
-do_eval --nodo_train --eval_batch_size=8 \
--init_checkpoint=./cleanup_scripts/wiki/tf2_ckpt/model.ckpt-28252.index \
--iterations_per_loop=1000 --learning_rate=0.0001 --max_eval_steps=1250 \
--max_predictions_per_seq=76 --max_seq_length=512 --num_gpus=1 \
--num_train_steps=107538 --num_warmup_steps=1562 --optimizer=lamb \
--save_checkpoints_steps=1562 --start_warmup_step=0 --train_batch_size=24 --nouse_tpu

I see the following message in the output:

INFO:tensorflow:Could not find trained model in model_dir: /tmp/output/, running initialization to evaluate.
I0504 12:21:51.284748 140593756890944 estimator.py:496] Could not find trained model in model_dir: /tmp/output/, running initialization to evaluate.

This message appears at he end of each evaluation. Like this:

INFO:tensorflow:Evaluation [125/1250]
I0504 12:19:14.677203 140593756890944 evaluation.py:163] Evaluation [125/1250]
INFO:tensorflow:Evaluation [250/1250]
I0504 12:19:31.882551 140593756890944 evaluation.py:163] Evaluation [250/1250]
INFO:tensorflow:Evaluation [375/1250]
I0504 12:19:49.123802 140593756890944 evaluation.py:163] Evaluation [375/1250]
INFO:tensorflow:Evaluation [500/1250]
I0504 12:20:06.435338 140593756890944 evaluation.py:163] Evaluation [500/1250]
INFO:tensorflow:Evaluation [625/1250]
I0504 12:20:23.763262 140593756890944 evaluation.py:163] Evaluation [625/1250]
INFO:tensorflow:Evaluation [750/1250]
I0504 12:20:41.139376 140593756890944 evaluation.py:163] Evaluation [750/1250]
INFO:tensorflow:Evaluation [875/1250]
I0504 12:20:58.551066 140593756890944 evaluation.py:163] Evaluation [875/1250]
INFO:tensorflow:Evaluation [1000/1250]
I0504 12:21:16.016685 140593756890944 evaluation.py:163] Evaluation [1000/1250]
INFO:tensorflow:Evaluation [1125/1250]
I0504 12:21:33.469490 140593756890944 evaluation.py:163] Evaluation [1125/1250]
INFO:tensorflow:Evaluation [1250/1250]
I0504 12:21:50.927570 140593756890944 evaluation.py:163] Evaluation [1250/1250]
INFO:tensorflow:Inference Time : 185.22034s
I0504 12:21:51.132023 140593756890944 evaluation.py:269] Inference Time : 185.22034s
INFO:tensorflow:Finished evaluation at 2023-05-04-12:21:51
I0504 12:21:51.132229 140593756890944 evaluation.py:271] Finished evaluation at 2023-05-04-12:21:51
INFO:tensorflow:Saving dict for global step 0: global_step = 0, loss = 11.218545, masked_lm_accuracy = 7.01318e-06, masked_lm_loss = 10.521475, next_sentence_accuracy = 0.44830003, next_sentence_loss = 0.6969624
I0504 12:21:51.132311 140593756890944 estimator.py:2083] Saving dict for global step 0: global_step = 0, loss = 11.218545, masked_lm_accuracy = 7.01318e-06, masked_lm_loss = 10.521475, next_sentence_accuracy = 0.44830003, next_sentence_loss = 0.6969624
:::MLLOG {"namespace": "", "time_ms": 1683195711283, "event_type": "INTERVAL_END", "key": "eval_stop", "value": 0, "metadata": {"file": "run_pretraining.py", "lineno": 629, "step_num": 0}}
:::MLLOG {"namespace": "", "time_ms": 1683195711284, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 7.013180038484279e-06, "metadata": {"file": "run_pretraining.py", "lineno": 631, "step_num": 0}}
INFO:tensorflow:***** Eval results *****
I0504 12:21:51.284175 140593756890944 run_pretraining.py:637] ***** Eval results *****
INFO:tensorflow:  global_step = 0
I0504 12:21:51.284245 140593756890944 run_pretraining.py:639]   global_step = 0
INFO:tensorflow:  loss = 11.218545
I0504 12:21:51.284356 140593756890944 run_pretraining.py:639]   loss = 11.218545
INFO:tensorflow:  masked_lm_accuracy = 7.01318e-06
I0504 12:21:51.284395 140593756890944 run_pretraining.py:639]   masked_lm_accuracy = 7.01318e-06
INFO:tensorflow:  masked_lm_loss = 10.521475
I0504 12:21:51.284429 140593756890944 run_pretraining.py:639]   masked_lm_loss = 10.521475
INFO:tensorflow:  next_sentence_accuracy = 0.44830003
I0504 12:21:51.284461 140593756890944 run_pretraining.py:639]   next_sentence_accuracy = 0.44830003
INFO:tensorflow:  next_sentence_loss = 0.6969624
I0504 12:21:51.284492 140593756890944 run_pretraining.py:639]   next_sentence_loss = 0.6969624
:::MLLOG {"namespace": "", "time_ms": 1683195711284, "event_type": "INTERVAL_START", "key": "eval_start", "value": null, "metadata": {"file": "run_pretraining.py", "lineno": 619}}
INFO:tensorflow:Could not find trained model in model_dir: /tmp/output/, running initialization to evaluate.
I0504 12:21:51.284748 140593756890944 estimator.py:496] Could not find trained model in model_dir: /tmp/output/, running initialization to evaluate.
INFO:tensorflow:Calling model_fn.
I0504 12:21:51.303224 140593756890944 estimator.py:1173] Calling model_fn.
.....
.....
I0504 12:24:38.859874 140593756890944 evaluation.py:163] Evaluation [1125/1250]
INFO:tensorflow:Evaluation [1250/1250]
I0504 12:24:56.348826 140593756890944 evaluation.py:163] Evaluation [1250/1250]
INFO:tensorflow:Inference Time : 183.66613s
I0504 12:24:56.513465 140593756890944 evaluation.py:269] Inference Time : 183.66613s
INFO:tensorflow:Finished evaluation at 2023-05-04-12:24:56
I0504 12:24:56.513627 140593756890944 evaluation.py:271] Finished evaluation at 2023-05-04-12:24:56
INFO:tensorflow:Saving dict for global step 0: global_step = 0, loss = 11.219577, masked_lm_accuracy = 7.0131806e-05, masked_lm_loss = 10.469661, next_sentence_accuracy = 0.43410003, next_sentence_loss = 0.7496157
I0504 12:24:56.513709 140593756890944 estimator.py:2083] Saving dict for global step 0: global_step = 0, loss = 11.219577, masked_lm_accuracy = 7.0131806e-05, masked_lm_loss = 10.469661, next_sentence_accuracy = 0.43410003, next_sentence_loss = 0.7496157
:::MLLOG {"namespace": "", "time_ms": 1683195896514, "event_type": "INTERVAL_END", "key": "eval_stop", "value": 0, "metadata": {"file": "run_pretraining.py", "lineno": 629, "step_num": 0}}
:::MLLOG {"namespace": "", "time_ms": 1683195896514, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 7.0131805841811e-05, "metadata": {"file": "run_pretraining.py", "lineno": 631, "step_num": 0}}
INFO:tensorflow:***** Eval results *****
I0504 12:24:56.514305 140593756890944 run_pretraining.py:637] ***** Eval results *****
INFO:tensorflow:  global_step = 0
I0504 12:24:56.514346 140593756890944 run_pretraining.py:639]   global_step = 0
INFO:tensorflow:  loss = 11.219577
I0504 12:24:56.514452 140593756890944 run_pretraining.py:639]   loss = 11.219577
INFO:tensorflow:  masked_lm_accuracy = 7.0131806e-05
I0504 12:24:56.514493 140593756890944 run_pretraining.py:639]   masked_lm_accuracy = 7.0131806e-05
INFO:tensorflow:  masked_lm_loss = 10.469661
I0504 12:24:56.514528 140593756890944 run_pretraining.py:639]   masked_lm_loss = 10.469661
INFO:tensorflow:  next_sentence_accuracy = 0.43410003
I0504 12:24:56.514561 140593756890944 run_pretraining.py:639]   next_sentence_accuracy = 0.43410003
INFO:tensorflow:  next_sentence_loss = 0.7496157
I0504 12:24:56.514594 140593756890944 run_pretraining.py:639]   next_sentence_loss = 0.7496157
:::MLLOG {"namespace": "", "time_ms": 1683195896514, "event_type": "INTERVAL_START", "key": "eval_start", "value": null, "metadata": {"file": "run_pretraining.py", "lineno": 619}}
INFO:tensorflow:Could not find trained model in model_dir: /tmp/output/, running initialization to evaluate.
I0504 12:24:56.514858 140593756890944 estimator.py:496] Could not find trained model in model_dir: /tmp/output/, running initialization to evaluate.
INFO:tensorflow:Calling model_fn.
I0504 12:24:56.534088 140593756890944 estimator.py:1173] Calling model_fn.

Is that normal? Should I ignore it?

Daming-wang commented 1 year ago

Did you run the pre-training command? Before running the evaluation, you should run the pre-training command to get the output to --output_dir=/tmp/output. Then, you can run the evaluation command with --output_dir, which is the same as the pre-training command. If the --output_dir is correct, the evaluation will run just once, otherwise, it will loop endlessly.

mahmoodn commented 1 year ago

I ran create_pretraining_data.py and and the output file is ./eval_intermediate.

$ ls cleanup_scripts/eval_intermediate -lh
-rw-rw-r-- 1 mahmood mahmood 806M May  3 17:36 cleanup_scripts/eval_intermediate

The exact sequence of commands are

python3 create_pretraining_data.py  \
    --input_file=./results/eval.txt \
    --output_file=./eval_intermediate \
    --vocab_file=./wiki/vocab.txt \
    --do_lower_case=True --max_seq_length=512 --max_predictions_per_seq=76 \
    --masked_lm_prob=0.15 --random_seed=12345 --dupe_factor=10

TF_XLA_FLAGS='--tf_xla_auto_jit=2' python3 run_pretraining.py      \
    --bert_config_file=./cleanup_scripts/wiki/bert_config.json      \
    --output_dir=/tmp/output/      \
    --input_file=./cleanup_scripts/tfrecord/eval_10k       \
    --do_eval --nodo_train --eval_batch_size=8      \
    --init_checkpoint=./cleanup_scripts/wiki/tf2_ckpt/model.ckpt-28252.index      \
    --iterations_per_loop=1000 --learning_rate=0.0001 --max_eval_steps=1250      \
    --max_predictions_per_seq=76 --max_seq_length=512 --num_gpus=1     \
    --num_train_steps=107538 --num_warmup_steps=1562 --optimizer=lamb      \
    --save_checkpoints_steps=1562 --start_warmup_step=0 --train_batch_size=24 --nouse_tpu

I am confused that eval_intermediate is not used in the run_pretraining.py script.

Daming-wang commented 1 year ago

Apologies, I may not have been clear enough. What I mean is, after you have completed the datasets, you need to run the pre_training process to get the output (here, "output" refers to the model, not the datasets). pre_train command like this: TF_XLA_FLAGS='--tf_xla_auto_jit=2' \ python run_pretraining.py \ --bert_config_file=<path to bert_config.json> \ --output_dir=/tmp/output/ \ --input_file="<tfrecord dir>/part*" \ --nodo_eval \ --do_train \ --eval_batch_size=8 \ --learning_rate=0.0001 \ --init_checkpoint=./checkpoint/model.ckpt-28252 \ --iterations_per_loop=1000 \ --max_predictions_per_seq=76 \ --max_seq_length=512 \ --num_train_steps=107538 \ --num_warmup_steps=1562 \ --optimizer=lamb \ --save_checkpoints_steps=6250 \ --start_warmup_step=0 \ --num_gpus=8 \ --train_batch_size=24 You need to run pre-training before running evaluation to obtain the output required for evaluation.

mahmoodn commented 1 year ago

Something is confusing here. If you refer to the first post, I mentioned that when I run run_pretraining.py, I see this message Could not find trained model in model_dir: /tmp/output/. So, the pretrain scripts can not find the trained model. What is missing here?

Daming-wang commented 1 year ago

I run the train command like this:

 TF_XLA_FLAGS='--tf_xla_auto_jit=2' \ 
python run_pretraining.py \ 
--bert_config_file=<path to bert_config.json> \ 
--output_dir=/tmp/output/ \ 
--input_file="<tfrecord dir>/part*" \ 
--nodo_eval \ 
--do_train \ 
--eval_batch_size=8 \ 
--learning_rate=0.0001 \ 
--init_checkpoint=./checkpoint/model.ckpt-28252 \ 
--iterations_per_loop=1000 \ 
--max_predictions_per_seq=76 \ 
--max_seq_length=512 \ 
--num_train_steps=107538 \ 
--num_warmup_steps=1562 \ 
--optimizer=lamb \ 
--save_checkpoints_steps=6250 \ 
--start_warmup_step=0 \ 
--num_gpus=8 \ 
--train_batch_size=24

after train ,I obtain some file in /tmp/output like this

ls /tmp/output/

checkpoint                                   model.ckpt-0.data-00000-of-00001
eval                                         model.ckpt-0.index
events.out.tfevents.1686206301.d74a203062ca  model.ckpt-0.meta
graph.pbtxt
ShriyaPalsamudram commented 4 months ago

@mahmoodn is this still an issue?

mahmoodn commented 4 months ago

No I forgot to close it.