Closed mahmoodn closed 4 months ago
Did you run the pre-training command? Before running the evaluation, you should run the pre-training command to get the output to --output_dir=/tmp/output. Then, you can run the evaluation command with --output_dir, which is the same as the pre-training command. If the --output_dir is correct, the evaluation will run just once, otherwise, it will loop endlessly.
I ran create_pretraining_data.py
and and the output file is ./eval_intermediate
.
$ ls cleanup_scripts/eval_intermediate -lh
-rw-rw-r-- 1 mahmood mahmood 806M May 3 17:36 cleanup_scripts/eval_intermediate
The exact sequence of commands are
python3 create_pretraining_data.py \
--input_file=./results/eval.txt \
--output_file=./eval_intermediate \
--vocab_file=./wiki/vocab.txt \
--do_lower_case=True --max_seq_length=512 --max_predictions_per_seq=76 \
--masked_lm_prob=0.15 --random_seed=12345 --dupe_factor=10
TF_XLA_FLAGS='--tf_xla_auto_jit=2' python3 run_pretraining.py \
--bert_config_file=./cleanup_scripts/wiki/bert_config.json \
--output_dir=/tmp/output/ \
--input_file=./cleanup_scripts/tfrecord/eval_10k \
--do_eval --nodo_train --eval_batch_size=8 \
--init_checkpoint=./cleanup_scripts/wiki/tf2_ckpt/model.ckpt-28252.index \
--iterations_per_loop=1000 --learning_rate=0.0001 --max_eval_steps=1250 \
--max_predictions_per_seq=76 --max_seq_length=512 --num_gpus=1 \
--num_train_steps=107538 --num_warmup_steps=1562 --optimizer=lamb \
--save_checkpoints_steps=1562 --start_warmup_step=0 --train_batch_size=24 --nouse_tpu
I am confused that eval_intermediate
is not used in the run_pretraining.py script.
Apologies, I may not have been clear enough. What I mean is, after you have completed the datasets, you need to run the pre_training process to get the output (here, "output" refers to the model, not the datasets).
pre_train command like this:
TF_XLA_FLAGS='--tf_xla_auto_jit=2' \ python run_pretraining.py \ --bert_config_file=<path to bert_config.json> \ --output_dir=/tmp/output/ \ --input_file="<tfrecord dir>/part*" \ --nodo_eval \ --do_train \ --eval_batch_size=8 \ --learning_rate=0.0001 \ --init_checkpoint=./checkpoint/model.ckpt-28252 \ --iterations_per_loop=1000 \ --max_predictions_per_seq=76 \ --max_seq_length=512 \ --num_train_steps=107538 \ --num_warmup_steps=1562 \ --optimizer=lamb \ --save_checkpoints_steps=6250 \ --start_warmup_step=0 \ --num_gpus=8 \ --train_batch_size=24
You need to run pre-training before running evaluation to obtain the output required for evaluation.
Something is confusing here.
If you refer to the first post, I mentioned that when I run run_pretraining.py
, I see this message Could not find trained model in model_dir: /tmp/output/
. So, the pretrain scripts can not find the trained model. What is missing here?
I run the train command like this:
TF_XLA_FLAGS='--tf_xla_auto_jit=2' \
python run_pretraining.py \
--bert_config_file=<path to bert_config.json> \
--output_dir=/tmp/output/ \
--input_file="<tfrecord dir>/part*" \
--nodo_eval \
--do_train \
--eval_batch_size=8 \
--learning_rate=0.0001 \
--init_checkpoint=./checkpoint/model.ckpt-28252 \
--iterations_per_loop=1000 \
--max_predictions_per_seq=76 \
--max_seq_length=512 \
--num_train_steps=107538 \
--num_warmup_steps=1562 \
--optimizer=lamb \
--save_checkpoints_steps=6250 \
--start_warmup_step=0 \
--num_gpus=8 \
--train_batch_size=24
after train ,I obtain some file in /tmp/output like this
ls /tmp/output/
checkpoint model.ckpt-0.data-00000-of-00001
eval model.ckpt-0.index
events.out.tfevents.1686206301.d74a203062ca model.ckpt-0.meta
graph.pbtxt
@mahmoodn is this still an issue?
No I forgot to close it.
Hi, When I run the final Bert command:
I see the following message in the output:
This message appears at he end of each evaluation. Like this:
Is that normal? Should I ignore it?