google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.18k stars 756 forks source link

TPU VM Training Error (EOFError: marshal data too short) #1037

Open nadhem-zmandar opened 2 years ago

nadhem-zmandar commented 2 years ago

Describe the bug I am trying to fine-tune the mT5 dataset on a custom dataset on a TPU on GCP. I am following carefully the process described in this repository however I have a tensorflow-related error.

2022-07-13 22:29:42.556669: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-13 22:29:42.556729: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "/home/.local/bin/t5_mesh_transformer", line 5, in <module>
    from t5.models.mesh_transformer_main import console_entry_point
  File "/home/.local/lib/python3.9/site-packages/t5/__init__.py", line 17, in <module>
    import t5.data
  File "/home/.local/lib/python3.9/site-packages/t5/data/__init__.py", line 17, in <module>
    from t5.data.dataset_providers import *
  File "/home/.local/lib/python3.9/site-packages/t5/data/dataset_providers.py", line 28, in <module>
    import seqio
  File "/home/.local/lib/python3.9/site-packages/seqio/__init__.py", line 18, in <module>
    from seqio.dataset_providers import *
  File "/home/.local/lib/python3.9/site-packages/seqio/dataset_providers.py", line 34, in <module>
    from seqio import utils
  File "/home/.local/lib/python3.9/site-packages/seqio/utils.py", line 25, in <module>
    import tensorflow.compat.v2 as tf
  File "/home/.local/lib/python3.9/site-packages/tensorflow/__init__.py", line 37, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/__init__.py", line 42, in <module>
    from tensorflow.python import data
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/__init__.py", line 21, in <module>
    from tensorflow.python.data import experimental
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/experimental/__init__.py", line 95, in <module>
    from tensorflow.python.data.experimental import service
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/experimental/service/__init__.py", line 387, in <module>
    from tensorflow.python.data.experimental.ops.data_service_ops import distribute
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/experimental/ops/data_service_ops.py", line 26, in <module>
    from tensorflow.python.data.ops import dataset_ops
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 31, in <module>
    from tensorflow.python.data.ops import iterator_ops
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 36, in <module>
    from tensorflow.python.training.saver import BaseSaverBuilder
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 51, in <module>
    from tensorflow.python.training.saving import saveable_object_util
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/training/saving/saveable_object_util.py", line 20, in <module>
    from tensorflow.python.eager import def_function
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 75, in <module>
    from tensorflow.python.eager import function as function_lib
  File "/home/.local/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 35, in <module>
    from tensorflow.python.eager import backprop
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 786, in exec_module
  File "<frozen importlib._bootstrap_external>", line 918, in get_code
  File "<frozen importlib._bootstrap_external>", line 587, in _compile_bytecode
EOFError: marshal data too short

To Reproduce Steps to reproduce the behavior:

  1. create a VM
  2. Create a TPU
  3. create a bucket and upload the .txt corpus on which I will train the model
  4. install t5[GCP] pip install t5[gcp]
  5. Set the Env variables following
  6. run the fine-tuning script
    t5_mesh_transformer  \
    --tpu="${TPU_NAME}" \
    --gcp_project="${PROJECT}" \
    --tpu_zone="${ZONE}" \
    --model_dir="${MODEL_DIR}" \
    --t5_tfds_data_dir="${DATA_DIR}" \
    --gin_file="dataset.gin" \
    - --gin_param="utils.tpu_mesh_shape.tpu_topology = '${TPU_SIZE}'" \
    --gin_param="MIXTURE_NAME = 'glue_mrpc_v002'" \
    --gin_param="run.train_steps = 1010000" \
    --gin_file="learning_rate_schedules/constant_0_001.gin"  \
    --gin_param = "tokens_per_batch=512" \
    --gin_file="gs://t5-data/pretrained_models/small/operative_config.gin" \

Expected behaviour the training on the TPU should start

Any help would be appreciated.

Thank you

anas-zafar commented 1 year ago

Hi @nadhem-zmandar , were you able to resolve this?