google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.88k stars 9.56k forks source link

how use the pretrain checkpoint to continue train on my own corpus? #888

Open RyanHuangNLP opened 4 years ago

RyanHuangNLP commented 4 years ago

I want to load the pretrain checkpoint to continue train on my own corpus, I use the run_pretrain.py code and set the init_checkpoint to the pretrain dir, while I run the code, it raise error?

ERROR:tensorflow:Error recorded from training_loop: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

From /job:worker/replica:0/task:0:
Key bert/embeddings/LayerNorm/beta/adam_m not found in checkpoint
     [[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

I know that when finish training, it is better to remove adam_m and adam_v parameter to reduce the size of the checkpoint file, but I while want to continue train on the pretrain checkpoint, how to sovle this problem, may be I can recovert adam reference variable name in the checkpoint file ?thank you

ibrahimishag commented 4 years ago

I want to load the pretrain checkpoint to continue train on my own corpus, I use the run_pretrain.py code and set the init_checkpoint to the pretrain dir, while I run the code, it raise error?

ERROR:tensorflow:Error recorded from training_loop: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

From /job:worker/replica:0/task:0:
Key bert/embeddings/LayerNorm/beta/adam_m not found in checkpoint
   [[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

I know that when finish training, it is better to remove adam_m and adam_v parameter to reduce the size of the checkpoint file, but I while want to continue train on the pretrain checkpoint, how to sovle this problem, may be I can recovert adam reference variable name in the checkpoint file ?thank you


I ran into a similar issue while trying to load a model trained under Tensorflow 1.x to work under the upgraded version under Tensorflow 2.0. If you have solved the issue, please share your approach.

RyanHuangNLP commented 4 years ago

@ibrahimishag tensorflow 2.0 variable name is different with the tensorflow 1.x, you may reference here.

manueltonneau commented 4 years ago

Hi all! I'm trying to initiate from mBERT checkpoint but it is missing the "bert/embeddings/LayerNorm/beta/adam_m" key in the list of variables (just like you described). I'm using TF=1.14 and have not found a solution in the conversion of checkpoint in TF >2. Did you find a solution?

manueltonneau commented 4 years ago

Hi @RyanHuangNLP, if you have found a solution for this problem, would you mind sharing it? :)

AakritiBudhraja commented 4 years ago

Hi, I am also facing the same issue : While trying to train from the mbert checkpoint : Key bert/embeddings/LayerNorm/beta/adam_m not found in checkpoint While trying to predict from the mbert checkpoint : Key global_step not found in checkpoint @RyanHuangNLP Did you find a solution for the same? Thanks in advance!

geo47 commented 3 years ago

Kindly share the solution if someone knows. Thanks

nikhildurgam95 commented 2 years ago

Did anyone figure out a solution for this? I'm facing the same problem. Kindly share if someone does know.