google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.11k stars 753 forks source link

Multiple graph issue following Colab example #206

Closed anthonyralston closed 4 years ago

anthonyralston commented 4 years ago

When copying out the code blocks from the t5 Trivia Colab example and switching really only the dataset, I am running into the following error:

INFO:tensorflow:Using config: {'_model_dir': 'gs://tacred/models/3B', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.70.89.186:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({'worker': ['10.70.89.186:8470']}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.70.89.186:8470', '_evaluation_master': 'grpc://10.70.89.186:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=100, num_shards=None, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu_cluster_resolver.TPUClusterResolver object at 0x7fa956ce8c88>}
INFO:tensorflow:_TPUContext: eval_on_tpu True
INFO:tensorflow:Querying Tensorflow master (grpc://10.70.89.186:8470) for TPU system metadata.
INFO:tensorflow:Initializing TPU system (master: grpc://10.70.89.186:8470) to fetch topology for model parallelism. This might take a while.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 12030765547675663440)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 8573362653273760432)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 14715300196924863413)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 10681032731637905951)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 12094845204974291340)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 18076351581126752532)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 3758747093134136340)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 263551941127461316)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 16302722464130166815)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 2824549390527005376)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 14578457931969383467)
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-866dcfefe973> in <module>()
      5     pretrained_model_dir=PRETRAINED_DIR,
----> 6     finetune_steps=FINETUNE_STEPS
      7 )

31 frames
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py in _assert_same_graph(original_item, item)
   5816   if original_item.graph is not item.graph:
   5817     raise ValueError("%s must be from the same graph as %s." %
-> 5818                      (item, original_item))
   5819 
   5820 

ValueError: Tensor("count:0", shape=(), dtype=int64, device=/job:worker/task:0/device:CPU:0) must be from the same graph as Tensor("MapDataset:0", shape=(), dtype=variant).
  In call to configurable 'mesh_train_dataset_fn' (<function mesh_train_dataset_fn at 0x7fa9e7686d90>) 

Any idea what might be causing this?

I should add this was in Colab.

sharannarang commented 4 years ago

@adarob , any ideas what could be causing this?

adarob commented 4 years ago

I'd need to see what changes you made to be able to help debug.

anthonyralston commented 4 years ago

Would I be able to flick you the Colab link by email @adarob ? If you could take a quick look I would really appreciate that as I'm a bit stuck.

adarob commented 4 years ago

Sure. adarob@google.com

On Thu, May 7, 2020 at 5:53 PM anthonyralston notifications@github.com wrote:

Would I be able to flick you the Colab link by email @adarob https://github.com/adarob ? If you could take a quick look I would really appreciate that as I'm a bit stuck.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/206#issuecomment-625515994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIJV2HKFUTMFPKIZBRIABLRQMUTXANCNFSM4MYK26EQ .

anthonyralston commented 4 years ago

Thanks, email sent!

adarob commented 4 years ago

I just had a quick look and the issue is that your dataset_fn is trying to access global variables. Your dataset_fn needs to be self-contained since it actually is run on the TPU host, not in the colab.

anthonyralston commented 4 years ago

Ah fantastic, thanks for taking a look.