tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.53k stars 3.5k forks source link

T2T 1.15.7 version with Tensorflow 2.2 - t2t-decoder doesn't run #1849

Open assij opened 4 years ago

assij commented 4 years ago

Description

When running t2t-decoder script ( En-De transformer-big) on a model which was trained on 8 GPUs using DistributedMirrorStrategy.

I get the following error ValueError: Tensor("body/parallel_0/body/decoder/layer_0/self_attention/multihead_attention/dot_product_attention/attention:0", shape=(), dtype=string, device=/device:GPU:0) must be from the same graph as Tensor("transformer_hparams:0", shape=(), dtype=string). ...

Environment information

OS: <your answer here>
Ubuntu 18.04.4 LTS

$ pip freeze | grep tensor

tensorboard==2.2.2
tensorboard-plugin-wit==1.7.0
tensorflow-addons==0.11.2
tensorflow-datasets==2.1.0
tensorflow-estimator==2.2.0
tensorflow-gan==2.0.0
tensorflow-gpu==2.2.0
tensorflow-hub==0.9.0
tensorflow-metadata==0.23.0
tensorflow-probability==0.7.0

$ python -V
Python 3.6.10 :: Anaconda, Inc.

For bugs: reproduction and error logs

# Steps to reproduce:
Run t2t-decoder from an input file

# Error logs:
INFO:tensorflow:Done calling model_fn.
I0913 15:54:39.852982 140520229324608 estimator.py:1171] Done calling model_fn.
Traceback (most recent call last):
  File "t2t-decoder", line 23, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "t2t-decoder", line 15, in main
    t2t_decoder.main(argv)
  File "/workdisk/tf2_2/tensor2tensor/tensor2tensor/bin/t2t_decoder.py", line 210, in main
    decode(estimator, hp, decode_hp)
  File "/workdisk/tf2_2/tensor2tensor/tensor2tensor/bin/t2t_decoder.py", line 99, in decode
    checkpoint_path=FLAGS.checkpoint_path)
  File "/workdisk/tf2_2/tensor2tensor/tensor2tensor/utils/decoding.py", line 481, in decode_from_file
    for elapsed_time, result in timer(result_iter):
  File "/workdisk/tf2_2/tensor2tensor/tensor2tensor/utils/decoding.py", line 473, in timer
    item = next(gen)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 629, in predict
    hooks=all_hooks) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1038, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 749, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1231, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1236, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 902, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 660, in create_session
    self._scaffold.finalize()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 232, in finalize
    summary.merge_all)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 297, in get_or_default
    op = default_constructor()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/summary/summary.py", line 406, in merge_all
    return merge(summary_ops, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/summary/summary.py", line 370, in merge
    with _ops.name_scope(name, 'Merge', inputs):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 6284, in __enter__
    g_from_inputs = _get_graph_from_inputs(self._values)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 5921, in _get_graph_from_inputs
    _assert_same_graph(original_graph_element, graph_element)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 5856, in _assert_same_graph
    (item, original_item))
ValueError: Tensor("body/parallel_0/body/decoder/layer_0/self_attention/multihead_attention/dot_product_attention/attention:0", shape=(), dtype=string, device=/device:GPU:0) must be from the same graph as Tensor("transformer_hparams:0", shape=(), dtype=string).
(tf2p2) ajakoby@debug-ttn7l:/workdisk/ajakoby/tf2_2/tensor2tensor/tensor2tensor/bin$ ajakoby@ajakoby-VM:~/Kubernetes$ 
neverdoubt commented 4 years ago

same here, i simply changed

import tensorflow as tf

import tensorflow.compat.v1 as tf

wjm41 commented 4 years ago

same here, i simply changed

import tensorflow as tf

import tensorflow.compat.v1 as tf

I tried this but it did not fix it (same issue as OP)

baojianzhou commented 4 years ago

@wjm41 I was wondering whether you fixed it or not. I have exactly the same issue here.

wjm41 commented 4 years ago

Haven't been able to fix it yet - looks like it's something to do with the save/loading of the model but I'm not experienced enough with TF to know where to look :(

baojianzhou commented 4 years ago

@wjm41 Thanks for replying. I fixed mine by adding the following:

import tensorflow.compat.v1 as tf tf.disable_v2_behavior()

wjm41 commented 4 years ago

@baojianzhou I tried adding that to both t2t-decoder and t2t-trainer which gives me a new error:

tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key transformer/body/parallel_0/body/encoder/layer_0/ffn/conv1/bias not found in checkpoint
     [[node save/RestoreV2_1 (defined at /lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py:630) ]]
baojianzhou commented 4 years ago

@wjm41, I believe got the same error too, if I recall it clearly. Have you retrained your model yet?

The reason is that, if you load the checkpoint (the model trained without adding tf.disable_v2_behavior()), Tensorflow will somehow still use some V2 features. My solution is that I just retrained the model from the beginning. The decoder process can be successfully finished after using the new trained checkpoint. Hope it helps.

wjm41 commented 4 years ago

@baojianzhou Yes it's working now! Thanks so much :)

assij commented 4 years ago

@baojianzhou I trained the model again with t2t-trainer having tf.disable_v2_behavior(), however the t2t-decoder still has issues. Can you please attach the files that you are using including the train command line + decoder command line.

wjm41 commented 4 years ago

@assij my t2t-trainer looks like this:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from tensor2tensor.bin import t2t_trainer

import tensorflow.compat.v1 as tf

def main(argv):
  t2t_trainer.main(argv)

if __name__ == "__main__":
  tf.disable_v2_behavior()
  tf.logging.set_verbosity(tf.logging.INFO)
  tf.app.run(main)

and my t2t-decoder looks like this:

"""t2t-decoder."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

#import tensorflow.compat.v1 as tf
from tensor2tensor.bin import t2t_decoder
import logging
#import tensorflow as tf
import tensorflow.compat.v1 as tf

def main(argv):
  t2t_decoder.main(argv)

if __name__ == "__main__":
  tf.disable_v2_behavior()
  tf.logging.set_verbosity(tf.logging.INFO)
  tf.app.run()
assij commented 4 years ago

@wjm41 Thanks, are you using the t2t-trainer with --optionally_use_dist_strat=True ?

wjm41 commented 4 years ago

@assij No I wasn't - I got it working for a transformer on a custom PROBLEM, not sure that changing hparams should affect this problem in particular.

assij commented 4 years ago

@wjm41 are you using t2t tag 1.15.7 as is with only the above 2 changes? are you doing training on multiple GPUs or 1 GPU? I'm working on multiple GPUs. can you please send the result of pip freeze | grep tensor

vikingmars commented 4 years ago

@wjm41 are you using t2t tag 1.15.7 as is with only the above 2 changes? are you doing training on multiple GPUs or 1 GPU? I'm working on multiple GPUs. can you please send the result of pip freeze | grep tensor

I've got exactly the same issue. I've tried solution mentioned above, but it's still not working... Have you fixed it?

Nanamumuhan commented 4 years ago

2020-11-11 05:25:43.286162: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint 2020-11-11 05:25:43.286801: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/parallel_0/body/encoder/layer_0/conv_branches/dense_2/bias not found in checkpoint Traceback (most recent call last): File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] (1) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] [[save/RestoreV2_1/_13]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

DawsenWSH commented 3 years ago

@wjm41, I believe got the same error too, if I recall it clearly. Have you retrained your model yet?

The reason is that, if you load the checkpoint (the model trained without adding tf.disable_v2_behavior()), Tensorflow will somehow still use some V2 features. My solution is that I just retrained the model from the beginning. The decoder process can be successfully finished after using the new trained checkpoint. Hope it helps.

I retrained the model,and add tf.disable_v2_behavior() to t2t-trainer ,t2t-decoder,t2t-translate-all,but I still have the problem : root error(s) found. (0) Not found: Key transformer/body/decoder/layer_0/encdec_attention/multihead_attention/k/kernel not found in checkpoint [[node save/RestoreV2 (defined at /lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:629) ]] (1) Not found: Key transformer/body/decoder/layer_0/encdec_attention/multihead_attention/k/kernel not found in checkpoint [[node save/RestoreV2 (defined at /lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:629) ]] [[save/RestoreV2_1/_249]]
Do you know the reason?

hashk1 commented 3 years ago

2020-11-11 05:25:43.286162: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint 2020-11-11 05:25:43.286801: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/parallel_0/body/encoder/layer_0/conv_branches/dense_2/bias not found in checkpoint Traceback (most recent call last): File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] (1) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] [[save/RestoreV2_1/_13]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

You should install tensor2tensor from github like as below:

git clone https://github.com/tensorflow/tensor2tensor.git
cd tensor2tensor
pip install .

then replace t2t-trainer and t2t-decoder to https://github.com/tensorflow/tensor2tensor/issues/1849#issuecomment-701491229

DawsenWSH commented 3 years ago

2020-11-11 05:25:43.286162: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint 2020-11-11 05:25:43.286801: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key evolved_transformer/body/parallel_0/body/encoder/layer_0/conv_branches/dense_2/bias not found in checkpoint Traceback (most recent call last): File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/WwhStuGrp/WwhStu11G/anaconda3/envs/py3.7-tensorflow/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] (1) Not found: Key evolved_transformer/body/decoder/layer_0/first_attend_to_encoder/multihead_attention/k/kernel not found in checkpoint [[{{node save/RestoreV2}}]] [[save/RestoreV2_1/_13]] 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred:

You should install tensor2tensor from github like as below:

git clone https://github.com/tensorflow/tensor2tensor.git
cd tensor2tensor
pip install .

then replace t2t-trainer and t2t-decoder to #1849 (comment)

Yes,it's working now!thanks very much!