google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.16k stars 756 forks source link

"Not found: Key decoder/block_000/layer_000/SelfAttention/relative_attention_bias not found in checkpoint" #36

Closed danyaljj closed 4 years ago

danyaljj commented 4 years ago

When running the following command for fine-tuning:

t5_mesh_transformer  \
  --t5_tfds_data_dir="gs://danielk-files" \
  --gin_file="dataset.gin" \
  --gin_param="utils.tpu_mesh_shape.model_parallelism = 1" \
  --gin_param="utils.tpu_mesh_shape.tpu_topology = '2x2'" \
  --gin_param="MIXTURE_NAME = 'glue_mrpc_v002'" \
  --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin"

I am getting the following error:

Not found: Key decoder/block_000/layer_000/SelfAttention/relative_attention_bias not found in checkpoint

Here is the full error log:

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
INFO:tensorflow:Graph was finalized.
I0107 10:43:38.246214 140625456166720 monitored_session.py:240] Graph was finalized.
2020-01-07 10:43:38.246462: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-01-07 10:43:38.277149: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2020-01-07 10:43:38.279231: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x91ab450 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-01-07 10:43:38.279274: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-01-07 10:43:38.755452: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x914b5f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-01-07 10:43:38.755483: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Quadro RTX 8000, Compute Capability 7.5
2020-01-07 10:43:38.755491: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Quadro RTX 8000, Compute Capability 7.5
2020-01-07 10:43:38.755498: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Quadro RTX 8000, Compute Capability 7.5
2020-01-07 10:43:38.772836: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:17:00.0
2020-01-07 10:43:38.774193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:65:00.0
2020-01-07 10:43:38.775493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties:
name: Quadro RTX 8000 major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:b3:00.0
2020-01-07 10:43:38.775601: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2020-01-07 10:43:38.775641: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2020-01-07 10:43:38.775673: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2020-01-07 10:43:38.775704: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2020-01-07 10:43:38.775735: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2020-01-07 10:43:38.775765: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2020-01-07 10:43:38.775796: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2020-01-07 10:43:38.775803: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required
 libraries for your platform.
Skipping registering GPU devices...
2020-01-07 10:43:38.775937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-07 10:43:38.775945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 1 2
2020-01-07 10:43:38.775949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N Y Y
2020-01-07 10:43:38.775953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 1:   Y N Y
2020-01-07 10:43:38.775957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 2:   Y Y N
INFO:tensorflow:Restoring parameters from /tmp/transformer_standalone/model.ckpt-0
I0107 10:43:38.778871 140625456166720 saver.py:1284] Restoring parameters from /tmp/transformer_standalone/model.ckpt-0
2020-01-07 10:43:44.485832: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key decoder/block_000/layer_000/SelfAttention/relative_attention_bias not found in checkpoint
ERROR:tensorflow:Error recorded from training_loop: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key decoder/block_000/layer_000/SelfAttention/relative_attention_bias not found in checkpoint
         [[node save/RestoreV2_1 (defined at /lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'save/RestoreV2_1':
  File "/bin/t5_mesh_transformer", line 8, in <module>
    sys.exit(console_entry_point())
  File "/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 218, in console_entry_point
    app.run(main)
  File "/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 212, in main
    model_dir=FLAGS.model_dir)
  File "/lib/python3.6/site-packages/gin/config.py", line 1055, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 1701, in run
    train_dataset_fn, train_steps, ensemble_inputs)
  File "/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 1092, in train_model
    estimator.train(input_fn=input_fn, max_steps=train_steps)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
    config)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3126, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1663, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1994, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 599, in my_model_fn
    save_relative_paths=True)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 828, in __init__
    self.build()
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 840, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 878, in _build
    build_restore=build_restore)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 502, in _build_internal
    restore_sequentially, reshape)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 381, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

E0107 10:43:44.498865 140625456166720 error_handling.py:75] Error recorded from training_loop: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key decoder/block_000/layer_000/SelfAttention/relative_attention_bias not found in checkpoint
         [[node save/RestoreV2_1 (defined at /lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'save/RestoreV2_1':
  File "/bin/t5_mesh_transformer", line 8, in <module>
    sys.exit(console_entry_point())
  File "/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 218, in console_entry_point
    app.run(main)
  File "/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 212, in main
    model_dir=FLAGS.model_dir)
  File "/lib/python3.6/site-packages/gin/config.py", line 1055, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 1701, in run
    train_dataset_fn, train_steps, ensemble_inputs)
  File "/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 1092, in train_model
    estimator.train(input_fn=input_fn, max_steps=train_steps)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
    config)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3126, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1663, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1994, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 599, in my_model_fn
    save_relative_paths=True)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 828, in __init__
    self.build()
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 840, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 878, in _build
    build_restore=build_restore)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 502, in _build_internal
    restore_sequentially, reshape)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 381, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

INFO:tensorflow:training_loop marked as finished
I0107 10:43:44.500117 140625456166720 error_handling.py:101] training_loop marked as finished
WARNING:tensorflow:Reraising captured error
W0107 10:43:44.500193 140625456166720 error_handling.py:135] Reraising captured error
Traceback (most recent call last):
  File "/home/danielk/text-to-text-transfer-transformer/env36/bin/t5_mesh_transformer", line 8, in <module>
    sys.exit(console_entry_point())
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 218, in console_entry_point
    app.run(main)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 212, in main
    model_dir=FLAGS.model_dir)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/gin/config.py", line 1078, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/gin/config.py", line 1055, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 1701, in run
    train_dataset_fn, train_steps, ensemble_inputs)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 1092, in train_model
    estimator.train(input_fn=input_fn, max_steps=train_steps)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/six.py", line 696, in reraise
    raise value
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 725, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1207, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1212, in _create_session
    return self._sess_creator.create_session()
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 878, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 647, in create_session
    init_fn=self._scaffold.init_fn)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/training/session_manager.py", line 290, in prepare_session
    config=config)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/training/session_manager.py", line 220, in _restore_checkpoint
    saver.restore(sess, ckpt.model_checkpoint_path)
  File "/home/danielk/text-to-text-transfer-transformer/env36/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1306, in restore
    err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Key decoder/block_000/layer_000/SelfAttention/relative_attention_bias not found in checkpoint
         [[node save/RestoreV2_1 (defined at /lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'save/RestoreV2_1':
  File "/bin/t5_mesh_transformer", line 8, in <module>
    sys.exit(console_entry_point())
  File "/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 218, in console_entry_point
    app.run(main)
  File "/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 212, in main
    model_dir=FLAGS.model_dir)
  File "/lib/python3.6/site-packages/gin/config.py", line 1055, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 1701, in run
    train_dataset_fn, train_steps, ensemble_inputs)
  File "/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 1092, in train_model
    estimator.train(input_fn=input_fn, max_steps=train_steps)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1191, in _train_model_default
    features, labels, ModeKeys.TRAIN, self.config)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
    config)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3126, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1663, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1994, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 599, in my_model_fn
    save_relative_paths=True)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 828, in __init__
    self.build()
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 840, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 878, in _build
    build_restore=build_restore)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 502, in _build_internal
    restore_sequentially, reshape)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 381, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

For some reason when I drop the last line --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin" it works fine; which is surprising since I was under the impression that this line determines the pre-trained model to use (small, base, large, etc).

Additional info: I am running it on a GPU machine, but it shouldn't be a problem since the error happens when loading the models (and before any computation).

craffel commented 4 years ago

Hi, you ought to pass in a

  --model_dir="${MODEL_DIR}" \

and make sure that there is a checkpoint in there that does not disagree with the pretrained model (small).

craffel commented 4 years ago

(it looks like it is defaulting to

/tmp/transformer_standalone/model.ckpt-0

which I am guessing was created from a previous run which was not with the "small" model, maybe?)

danyaljj commented 4 years ago

I see. Just to make sure, is the following a valid syntax for referencing the model-dir?

 --model_dir="gs://t5-data/pretrained_models/small/"

Also, what's the intent behind the following line?

  --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin"
craffel commented 4 years ago

I see. Just to make sure, is the following a valid syntax for referencing the model-dir?

Not quite, you want to choose a model dir which you have write access to.

Also, what's the intent behind the following line?

That loads all of the configuration for the small model (including the pre-trained checkpoint location).