Can't load t2t model - Githubissues

ghost commented 5 years ago

I ran this command:

python ../sgnmt/decode.py --ignore_sanity_checks True --config_file ini/mymodel.ini --src_test ../../data/test.txt.ja.fixed --range 1:2

and mymodel.ini is this:

predictors: t2t

t2t_model: transformer
t2t_checkpoint_dir: own_model/model1
t2t_problem: translate_jaen
t2t_hparams_set: transformer_base_single_gpu
t2t_usr_dir: /root/work/aspec_experiments/tensor2tensor/model
pred_src_vocab_size: 32768
pred_tgt_vocab_size: 32768
pred_src_vocab: own_model/model1/vocab.translate_jaen.32768.subwords
pred_tgt_vocab: own_model/model1/vocab.translate_jaen.32768.subwords

but it doesn't load my model.

Could you please tell me how to fix them?

2019-09-26 02:09:44,703 CRITICAL: Could not find all variables of the computation graph in the T2T checkpoint file. This means that the checkpoint does not correspond to the model specified in SGNMT. Please double-check pred_src_vocab_size, pred_trg_vocab_size, and all the t2t_* parameters. Also make sure that the checkpoint exists and is readable
2019-09-26 02:09:44,704 CRITICAL: Invalid argument for one of the predictors: Could not initialize TF session..Stack trace: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: Key transformer/symbol_modality_30000_512/softmax/weights_0 not found in checkpoint
         [[{{node save/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1286, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key transformer/symbol_modality_30000_512/softmax/weights_0 not found in checkpoint
         [[node save/RestoreV2 (defined at /root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/tf_utils.py:95) ]]

Original stack trace for 'save/RestoreV2':
  File "../sgnmt/decode.py", line 2, in <module>
    import cam.sgnmt.decode
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/decode.py", line 105, in <module>
    decoder = decode_utils.create_decoder()
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/decode_utils.py", line 657, in create_decoder
    add_predictors(decoder)
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/decode_utils.py", line 283, in add_predictors
    pop_id=args.syntax_pop_id)
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/predictors/tf_t2t.py", line 329, in __init__
    self.mon_sess = self.create_session()
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/predictors/tf_t2t.py", line 243, in create_session
    self._n_cpu_threads)
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/tf_utils.py", line 95, in create_session
    config=session_config(n_cpu_threads)))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1007, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1200, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1205, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 871, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 638, in create_session
    self._scaffold.finalize()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 229, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 599, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 502, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 381, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1296, in restore
    names_to_keys = object_graph_key_mapping(save_path)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1614, in object_graph_key_mapping
    object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 678, in get_tensor
    return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/tf_utils.py", line 95, in create_session
    config=session_config(n_cpu_threads)))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1007, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1200, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1205, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 871, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/session_manager.py", line 290, in prepare_session
    config=config)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/session_manager.py", line 204, in _restore_checkpoint
    saver.restore(sess, checkpoint_filename_with_path)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1302, in restore
    err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key transformer/symbol_modality_30000_512/softmax/weights_0 not found in checkpoint
         [[node save/RestoreV2 (defined at /root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/tf_utils.py:95) ]]

Original stack trace for 'save/RestoreV2':
  File "../sgnmt/decode.py", line 2, in <module>
    import cam.sgnmt.decode
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/decode.py", line 105, in <module>
    decoder = decode_utils.create_decoder()
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/decode_utils.py", line 657, in create_decoder
    add_predictors(decoder)
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/decode_utils.py", line 283, in add_predictors
    pop_id=args.syntax_pop_id)
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/predictors/tf_t2t.py", line 329, in __init__
    self.mon_sess = self.create_session()
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/predictors/tf_t2t.py", line 243, in create_session
    self._n_cpu_threads)
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/tf_utils.py", line 95, in create_session
    config=session_config(n_cpu_threads)))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1007, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1200, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1205, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 871, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 638, in create_session
    self._scaffold.finalize()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 229, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 599, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 825, in __init__
    self.build()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 837, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 875, in _build
    build_restore=build_restore)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 502, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 381, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/decode_utils.py", line 283, in add_predictors
    pop_id=args.syntax_pop_id)
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/predictors/tf_t2t.py", line 329, in __init__
    self.mon_sess = self.create_session()
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/predictors/tf_t2t.py", line 243, in create_session
    self._n_cpu_threads)
  File "/root/work/aspec_experiments/sgnmt/sgnmt/cam/sgnmt/tf_utils.py", line 103, in create_session
    raise AttributeError("Could not initialize TF session.")
AttributeError: Could not initialize TF session.

2019-09-26 02:09:44,725 CRITICAL: Terminated due to an error in the predictor configuration.

fstahlberg commented 5 years ago

As indicated in the error message, have you double-checked pred_src_vocab_size, pred_trg_vocabsize, and all the t2t* parameters? In particular, the following line:

NotFoundError: Key transformer/symbol_modality_30000_512/softmax/weights_0 not found in checkpoint

seems to suggest that your model you are trying to load has been trained with a vocabulary size of 30000. Try to set pred_src_vocab_size and pred_trg_vocab_size to 30000.

ghost commented 5 years ago

I changed pred_src_vocab_size and pred_trg_vocab_size to 30000, and also fixed typo 'tgt' to 'trg', but same errors occurred.

ghost commented 5 years ago

I checked my vocab size:

$ wc -l own_model/model1/vocab.translate_jaen.32768.subwords
36632 own_model/model1/vocab.translate_jaen.32768.subwords
and changed vocab size to 36638, and another error happened.

and changed vocab size to 36638, but another error happened.

I assume that it requires to input the integer IDs of subwords, so I encoded the sentences to the subword IDs. Finally, I got this:

2019-09-26 23:53:14,645 INFO: Next sentence (ID: 1): 29244 29593 29983 29948 11978 13215 18019 24558 26 85 1521 21484 22202 26 89 48 1395 1921 27611 8348 63 5107 2325 31
2019-09-26 23:53:15.547951: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-09-26 23:53:17,282 INFO: Decoded (ID: 1): 7790 80 6397 11115 26 5218 28 29 8590 71 2293 32 122 4732 4144 557 5509 225 91 72 22672 74 38 17183 27
2019-09-26 23:53:17,282 INFO: Stats (ID: 1): score=-7.478206 num_expansions=100 time=2.64

It seems to work. Thank you so much!

fstahlberg commented 5 years ago

Yes, SGNMT expects integer IDs by default, but it also supports text format - see the wmap, preprocessing, and postprocessing options.

ucam-smt / sgnmt

Can't load t2t model #6