tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.76k forks source link

there is a error in step3 of maskgan #7040

Open bbkk5401 opened 5 years ago

bbkk5401 commented 5 years ago

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key critic/rnn/biases not found in checkpoint [[node save/RestoreV2 (defined at train_mask_gan.py:431) ]]

I don't modify any code in previous step. but still encountered this problem. thanks~

///////////////////////////////update/////////////////////////////////// What is the top-level directory of the model you are using : maskgan Have I written custom code : No OS Platform : ubuntu18.04 TensorFlow installed from : conda TensorFlow version : 1.13.1 Bazel version n/a CUDA/cuDNN : 9 GPU model and memory : ASUS 1070ti 8G Exact command to reproduce : only readme

tensorflowbutler commented 5 years ago

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. What is the top-level directory of the model you are using Have I written custom code OS Platform and Distribution TensorFlow installed from TensorFlow version Bazel version CUDA/cuDNN version GPU model and memory Exact command to reproduce

approckw2013 commented 5 years ago

can you show the problem you face @bbkk5401
i have same problem

What is the top-level directory of the model you are using makgan Have I written custom code No OS Platform and Distribution mac mojava 10.14 TensorFlow installed from pip install TensorFlow version 1.13.1 Bazel version n/a CUDA/cuDNN n/a GPU model and memory Intel Iris Plus Graphics 655 1536 MB/8G Exact command to reproduce as readme file

bbkk5401 commented 5 years ago

can you show the problem you face @bbkk5401 i have same problem

What is the top-level directory of the model you are using makgan Have I written custom code No OS Platform and Distribution mac mojava 10.14 TensorFlow installed from pip install TensorFlow version 1.13.1 Bazel version n/a CUDA/cuDNN n/a GPU model and memory Intel Iris Plus Graphics 655 1536 MB/8G Exact command to reproduce as readme file

Hi @approckw2013 I still stuck in this problem, and try useing init_from_checkpoint to fix it, thanks~

a-dai commented 5 years ago

@liamb315 Could you look at this?

nimble00 commented 5 years ago

This is an issue with seq2seq model because it uses the attention mechanism. The issue arises if you saved the model with an earlier version (seq2seq is old) and restore with a recent one (saver.restore got updated). The naming convention for LSTM parameters changed, e.g. cell_0/basic_lstm_cell/weights became cell_0/basic_lstm_cell/kernel. Which is why you cannot restore them if you try to restore old checkpoints with recent TF. Please edit the original maskGAN readme file and add this information. The below script will help rename the variables and everything will work as expected. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/tools/checkpoint_convert.py

I tested and it worked for me, please confirm if it does for you.

canertol commented 4 years ago

This is an issue with seq2seq model because it uses the attention mechanism. The issue arises if you saved the model with an earlier version (seq2seq is old) and restore with a recent one (saver.restore got updated). The naming convention for LSTM parameters changed, e.g. cell_0/basic_lstm_cell/weights became cell_0/basic_lstm_cell/kernel. Which is why you cannot restore them if you try to restore old checkpoints with recent TF. Please edit the original maskGAN readme file and add this information. The below script will help rename the variables and everything will work as expected. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/tools/checkpoint_convert.py

I tested and it worked for me, please confirm if it does for you.

It doesn't work for me.

EDIT: The error was similar but the reason was different in my case. In step 2 of the Instructions , attention_option is not used but in step 3, attention weights are tried to be restored. If you add attention_option in step 2, that would be a solution.

EDIT2: For the same reason you should add --baseline_method=critic to the commands in step2 if you will use it in the other steps. It seems like somebody made a mistake while copy-pasting the last three lines in step 2:

--seq2seq_share_embedding=true \ --baseline_method=critic \ --attention_option=luong

Abhiram4572 commented 4 years ago

I think the file to checkpoint conversion script is changed to the following. https://github.com/tensorflow/tensorflow/blob/r1.12/tensorflow/contrib/rnn/python/tools/checkpoint_convert.py