tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.34k stars 3.47k forks source link

The variable is in the checkpoints, But the model cannot be loaded correctly. #1486

Open chuanHN opened 5 years ago

chuanHN commented 5 years ago

Description

Hi, I want to reproduce the CNN translation model. But I encounter the model load problem. When I use the tensorflow 1.8, the model seems to be loaded correctly. But when I use the tensorflow 1.12, the model can not be loaded. And the message is

 NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
837.train | [2019-03-13T08:58:51Z] 
837.train | [2019-03-13T08:58:51Z] Key while/cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/cnn_0/conv1d/conv1d_7/kernel not found in checkpoint
837.train | [2019-03-13T08:58:51Z]       [[node save/RestoreV2_1 (defined at /code/tensor2tensor/tensor2tensor/utils/decoding.py:368)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]

But I print the checkpoint variables, I found the variable was in the model.

('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/cnn_2/dense_8/kernel/Adam_1')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/cnn_5/dense_14/kernel/Adam_1')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/cnn_0/conv1d/conv1d_7/kernel/Adam')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/cnn_6/dense_16/kernel/Adam_1')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/cnn_3/dense_10/bias/Adam')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/encoder/dense/kernel/Adam')
('tensor_name: ', 'losses_avg/problem_0/extra_loss')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/cnn_5/conv1d/conv1d_12/kernel')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/cnn_3/conv1d/conv1d_10/kernel/Adam')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/encoder/cnn_1/conv1d/conv1d_1/kernel/Adam_1')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/cnn_3/dense_9/kernel/Adam')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/cnn_5/dense_14/bias')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/cnn_1/conv1d/conv1d_8/kernel/Adam')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/cnn_6/dense_16/bias/Adam_1')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/cnn_1/dense_5/kernel')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/cnn_0/conv1d/conv1d_7/kernel')
('tensor_name: ', 'cnn_translate/parallel_0_5/cnn_translate/cnn_translate/body/cnn_decoder/dense_17/bias/Adam_1')

So I was very confused with this problem. Can someone can help me ?

chuanHN commented 5 years ago

I found the problem: the tensor name was added 'while' when decoding using tensor2tensor.

tunadude09 commented 5 years ago

I'm experiencing the exact same error where "while" is prepended to a correct key name, which then causes NotFoundError..... Key while/.... not found in checkpoint when restoring from a checkpoint with t2t-decoder.py. @chuanHN Did you find a solution to your problem? I can't find where/how "while" is incorrectly being added to the key name.

tunadude09 commented 5 years ago

Another occurrence of this bug seems to have been reported on Stack Overflow: https://stackoverflow.com/questions/56776076/cannot-restore-from-checkpoint-bidirectional-backward-lstm-bias/57398902#57398902

Do any devs have any ideas/suggestions as to what might be causing "while" to be prepended to key names during checkpoint restoration?