how to get embeddings after running run_pretraining.py code

KavyaGujjala commented 5 years ago

Hi, I have run the code run_pretraining.py script on my domain specific data.

It seems like only checkpoints are saved. I have got two files 0000020.params and 0000020.states.

How can I save the model or get a model from .params and .states files in checkpoint folder so that I can use that model to get contextual embeddings.

Can someone please help me with this?

stevenwernercs commented 5 years ago

same issue ran run_pretraining.py

did not see the eval stats as the docs show, only below was my output tail

python run_pretraining.py  
 --input_file=/tmp/tf_examples.tfrecord   
--output_dir=/tmp/pretraining_output  
 --do_train=True   
--do_eval=True 
--bert_config_file=$BERT_BASE_DIR/bert_config.json   
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt   
--train_batch_size=32   --max_seq_length=128   
--max_predictions_per_seq=20   
--num_train_steps=20  
 --num_warmup_steps=10  
 --learning_rate=2e-5
...

INFO:tensorflow:  name = cls/seq_relationship/output_weights:0, shape = (2, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = cls/seq_relationship/output_bias:0, shape = (2,), *INIT_FROM_CKPT*
WARNING:tensorflow:From C:\Users\1133884\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\training\learning_rate_decay_v2.py:321: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2019-06-02 17:32:10.657478: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/pretraining_output\model.ckpt.

$ ls /tmp/pretraining_output
checkpoint  events.out.tfevents.1559492173.$USER
events.out.tfevents.1559493128.$USER  
graph.pbtxt  
model.ckpt-0.data-00000-of-00001
model.ckpt-0.index
model.ckpt-0.meta
model.ckpt-0_temp_d429a2381051436897cfe1ecc2c55df0

Should we simply call the below when running the model?

  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$MY_PRETRIANED_BERT_DIR/bert_model.ckpt \

EDIT...

I reduced the --num_train_steps=20 to two since i didnt see the eval output at the end.. Apparently my python kernel crashed silently.

$ ls /tmp/pretraining_output
checkpoint
eval
eval_results.txt
events.out.tfevents.1559497862.RICL-AI710741
graph.pbtxt
model.ckpt-0.data-00000-of-00001
model.ckpt-0.index
model.ckpt-0.meta
model.ckpt-2.data-00000-of-00001
model.ckpt-2.index
model.ckpt-2.meta

I still don't have a vocab or config, should we simply call the below when running the model?

  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$MY_PRETRIANED_BERT_DIR/model.ckpt \

KavyaGujjala commented 5 years ago

Hi @stevenwernercs To use the model you have pretrained, mention the checkpoint number also.

For example if this is your folder

$ ls /tmp/pretraining_output checkpoint eval eval_results.txt events.out.tfevents.1559497862.RICL-AI710741 graph.pbtxt model.ckpt-0.data-00000-of-00001 model.ckpt-0.index model.ckpt-0.meta model.ckpt-2.data-00000-of-00001 model.ckpt-2.index model.ckpt-2.meta

you can mention which checkpoint model to take, like

--vocab_file=$BERT_BASE_DIR/vocab.txt \ --bert_config_file=$BERT_BASE_DIR/bert_config.json \ --init_checkpoint=$MY_PRETRIANED_BERT_DIR/model.ckpt-2 \

vocab.txt and bert_config.json are from the bert pretrained model itself as you have mentioned.

Hope I have cleared your doubt.

AjitAntony commented 4 years ago

@KavyaGujjala . Did you get to know how to finally save the model ? Im facing a similar issue

Hi followed the instruction shared in Pre-training with BERT

i ran these 2 file as mentioned in the instruction , python create_pretraining_data.py python run_pretraining.py

after running those 2 file i got below list of files in the folder output_dir=/tmp/pretraining_output

checkpoint eval eval_results.txt events.out.tfevents.1575545953.26760d2fc979 graph.pbtxt model.ckpt-0.data-00000-of-00001 model.ckpt-0.index model.ckpt-0.meta model.ckpt-20.data-00000-of-00001 model.ckpt-20.index model.ckpt-20.meta

1.what should i do now to test and perform the prediction of masked words for the sentences in the sample_text.txt that was used as corpus to pretrain this masked LM .

for example: if i want to input a sentence "man walked into store to purchase a gallon of [MASK]" into the model that is pretrained (which is present in /tmp/pretraining_output)

I know how we can load the original BertForMaskedLM model to do the mask word prediction but how to do it for this pretrained model that i have got in the output_dir=/tmp/pretraining_output ?

when i tried to load the model from this output instead of using BertForMaskedLM.from_pretrained('bert-base-uncased') i got below errror

from transformers import BertForMaskedLM BertNSP=BertForMaskedLM.from_pretrained('/content/drive/My Drive/bert_training/pretraining_output/')

when i import Model name '/content/drive/My Drive/bert_training/pretraining_output/' was not found in model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased).

We assumed '/content/drive/My Drive/bert_training/pretraining_output/config.json' was a path or url to a configuration file named config.json or a directory containing such a file but couldn't find any such file at this path or url.

can you please tell me how should i proceed further after getting these files ,how can save it as a model and do the masked word prediction?

checkpoint eval eval_results.txt events.out.tfevents.1575545953.26760d2fc979 graph.pbtxt model.ckpt-0.data-00000-of-00001 model.ckpt-0.index model.ckpt-0.meta model.ckpt-20.data-00000-of-00001 model.ckpt-20.index model.ckpt-20.meta

google-research / bert

how to get embeddings after running run_pretraining.py code #521