google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.2k stars 757 forks source link

fine-tuning stuck running from command-line, but runs in ipython #558

Open jayendra13 opened 3 years ago

jayendra13 commented 3 years ago

I have created a finetuneing script from the t5-trivia notebook . The script mentioned above works if I run the code from the ipython, but hangs if I run the code from the commandline.

No log gets printed after the following in the command-line mode, i.e. python t5test.py, whereas while copy-pasting the same code to ipython runs the code successfully.

2020-12-02 16:34:33.621945: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-12-02 16:34:33.621994: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
ERROR:root:Path not found: gs://t5-data/pretrained_models/small/operative_config.gin
2020-12-02 16:34:39.199438: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:373] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
/usr/local/lib/python3.6/dist-packages/t5/data/utils.py:197: UserWarning: Creating resources inside a function passed to Dataset.map() is not supported. Create each resource outside the function, and capture it inside the function to use it.
  return dataset.map(my_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)
2020-12-02 16:34:44.751066: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-12-02 16:34:44.751119: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2020-12-02 16:34:44.751147: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (9b4ce1746994): /proc/driver/nvidia/version does not exist
WARNING:tensorflow:SimdMeshImpl ignoring devices ['', '', '', '', '', '', '', '']
WARNING:tensorflow:SimdMeshImpl ignoring devices ['', '', '', '', '', '', '', '']
WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable encoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable encoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable decoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
WARNING:tensorflow:Using default tf glorot_uniform_initializer for variable decoder/block_000/layer_000/SelfAttention/relative_attention_bias  The initialzer will guess the input and output dimensions  based on dimension order.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1077: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py:1077: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:767: Variable.load (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer Variable.assign which has equivalent behavior in 2.X.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:767: Variable.load (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer Variable.assign which has equivalent behavior in 2.X.
2020-12-02 16:35:25.241792: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:373] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.

Here is the version info

root@9b4ce1746994:/work# pip freeze | grep 'tensorflow'
mesh-tensorflow==0.1.17
tensorflow==2.3.1
tensorflow-datasets==4.1.0
tensorflow-estimator==2.3.0
tensorflow-metadata==0.25.0
tensorflow-text==2.3.0
root@9b4ce1746994:/work# pip freeze | grep 't5'
t5==0.7.1
craffel commented 3 years ago

Try passing the tpu name, zone, and project directly into the MtfModel initialization arguments rather than using cluster resolver to get the address.

https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/models/mtf_model.py#L47

jayendra13 commented 3 years ago

I have tried that but it doesn't make any difference.

craffel commented 3 years ago

Is your TPU in running status? Was it pre-empted or shut down? Are your gcloud credentials set up? If you are passing in those arguments correctly, there's not a lot I can do to debug - we use this codepath all the time on Cloud without issue.