tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.5k stars 3.49k forks source link

Load checkpoint for transformer_parallel model #1415

Open goyalrasna opened 5 years ago

goyalrasna commented 5 years ago

Description

I am trying to follow the approach as mentioned in paper: block parallel decoding for deep autoregressive models. It states that firstly the model is trained on transformer model for a task using hparam : transformer_base and on top of this transformer_block_parallel is trained . I am not able to load the checkpoint created after training using transformer, to train on transformer_block_parallel.

...

Environment information

OS: <your answer here>

$ pip freeze | grep tensor
# your output here

$ python -V
# your output here

For bugs: reproduction and error logs

# Steps to reproduce:
...
# Error logs:
2019-01-28 19:00:29.216679: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key transformer_block_parallel/body/block_size_2/conv1/bias not found in checkpoint
Traceback (most recent call last):
  File "/home/rasna_goyal66/.local/bin/t2t-trainer", line 33, in <module>
    tf.app.run()
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/rasna_goyal66/.local/bin/t2t-trainer", line 28, in main
    t2t_trainer.main(argv)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensor2tensor/bin/t2t_trainer.py", line 393, in main
    execute_schedule(exp)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensor2tensor/bin/t2t_trainer.py", line 349, in execute_schedule
    getattr(exp, FLAGS.schedule)()
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensor2tensor/utils/trainer_lib.py", line 439, in continuous_train_and_eval
    return self.evaluate()
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensor2tensor/utils/trainer_lib.py", line 514, in evaluate
    name=name)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 478, in evaluate
    return _evaluate()
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 467, in _evaluate
    output_dir=self.eval_dir(name))
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1591, in _evaluate_run
    config=self._session_config)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/evaluation.py", line 271, in _evaluate_once
    session_creator=session_creator, hooks=hooks) as session:
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
    return self._sess_creator.create_session()
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 566, in create_session
    init_fn=self._scaffold.init_fn)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 288, in prepare_session
    config=config)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 202, in _restore_checkpoint
    saver.restore(sess, checkpoint_filename_with_path)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1562, in restore
    err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key transformer_block_parallel/body/block_size_2/conv1/bias not found in checkpoint
     [[node save/RestoreV2_1 (defined at /home/rasna_goyal66/.local/lib/python2.7/site-packages/tensor2tensor/utils/trainer_lib.py:514)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]

Caused by op u'save/RestoreV2_1', defined at:
  File "/home/rasna_goyal66/.local/bin/t2t-trainer", line 33, in <module>
    tf.app.run()
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/rasna_goyal66/.local/bin/t2t-trainer", line 28, in main
    t2t_trainer.main(argv)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensor2tensor/bin/t2t_trainer.py", line 393, in main
    execute_schedule(exp)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensor2tensor/bin/t2t_trainer.py", line 349, in execute_schedule
    getattr(exp, FLAGS.schedule)()
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensor2tensor/utils/trainer_lib.py", line 439, in continuous_train_and_eval
    return self.evaluate()
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensor2tensor/utils/trainer_lib.py", line 514, in evaluate
    name=name)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 478, in evaluate
    return _evaluate()
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 467, in _evaluate
    output_dir=self.eval_dir(name))
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1591, in _evaluate_run
    config=self._session_config)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/evaluation.py", line 271, in _evaluate_once
    session_creator=session_creator, hooks=hooks) as session:
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
    return self._sess_creator.create_session()
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 557, in create_session
    self._scaffold.finalize()
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 213, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 886, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1102, in __init__
    self.build()
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1114, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1151, in _build
    build_save=build_save, build_restore=build_restore)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 789, in _build_internal
    restore_sequentially, reshape)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 459, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 406, in _AddRestoreOps
    restore_sequentially)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 862, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1466, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/rasna_goyal66/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key transformer_block_parallel/body/block_size_2/conv1/bias not found in checkpoint
     [[node save/RestoreV2_1 (defined at /home/rasna_goyal66/.local/lib/python2.7/site-packages/tensor2tensor/utils/trainer_lib.py:514)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]
vivian-stars commented 5 years ago

same problem,anyone have a solution?