Describe the bug
I am trying to fine-tune the mT5 dataset on a custom dataset on a TPU on GCP. I am following carefully the process described in this repository however I have a tensorflow-related error.
2022-07-13 22:29:42.556669: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-13 22:29:42.556729: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "/home/.local/bin/t5_mesh_transformer", line 5, in <module>
from t5.models.mesh_transformer_main import console_entry_point
File "/home/.local/lib/python3.9/site-packages/t5/__init__.py", line 17, in <module>
import t5.data
File "/home/.local/lib/python3.9/site-packages/t5/data/__init__.py", line 17, in <module>
from t5.data.dataset_providers import *
File "/home/.local/lib/python3.9/site-packages/t5/data/dataset_providers.py", line 28, in <module>
import seqio
File "/home/.local/lib/python3.9/site-packages/seqio/__init__.py", line 18, in <module>
from seqio.dataset_providers import *
File "/home/.local/lib/python3.9/site-packages/seqio/dataset_providers.py", line 34, in <module>
from seqio import utils
File "/home/.local/lib/python3.9/site-packages/seqio/utils.py", line 25, in <module>
import tensorflow.compat.v2 as tf
File "/home/.local/lib/python3.9/site-packages/tensorflow/__init__.py", line 37, in <module>
from tensorflow.python.tools import module_util as _module_util
File "/home/.local/lib/python3.9/site-packages/tensorflow/python/__init__.py", line 42, in <module>
from tensorflow.python import data
File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/__init__.py", line 21, in <module>
from tensorflow.python.data import experimental
File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/experimental/__init__.py", line 95, in <module>
from tensorflow.python.data.experimental import service
File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/experimental/service/__init__.py", line 387, in <module>
from tensorflow.python.data.experimental.ops.data_service_ops import distribute
File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/experimental/ops/data_service_ops.py", line 26, in <module>
from tensorflow.python.data.ops import dataset_ops
File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 31, in <module>
from tensorflow.python.data.ops import iterator_ops
File "/home/.local/lib/python3.9/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 36, in <module>
from tensorflow.python.training.saver import BaseSaverBuilder
File "/home/.local/lib/python3.9/site-packages/tensorflow/python/training/saver.py", line 51, in <module>
from tensorflow.python.training.saving import saveable_object_util
File "/home/.local/lib/python3.9/site-packages/tensorflow/python/training/saving/saveable_object_util.py", line 20, in <module>
from tensorflow.python.eager import def_function
File "/home/.local/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 75, in <module>
from tensorflow.python.eager import function as function_lib
File "/home/.local/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 35, in <module>
from tensorflow.python.eager import backprop
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 786, in exec_module
File "<frozen importlib._bootstrap_external>", line 918, in get_code
File "<frozen importlib._bootstrap_external>", line 587, in _compile_bytecode
EOFError: marshal data too short
To Reproduce
Steps to reproduce the behavior:
create a VM
Create a TPU
create a bucket and upload the .txt corpus on which I will train the model
Describe the bug I am trying to fine-tune the mT5 dataset on a custom dataset on a TPU on GCP. I am following carefully the process described in this repository however I have a tensorflow-related error.
To Reproduce Steps to reproduce the behavior:
Expected behaviour the training on the TPU should start
Any help would be appreciated.
Thank you