Closed SarraCode closed 3 years ago
Hi, you need the --module_import="multilingual_t5.tasks"
flag, see https://github.com/google-research/multilingual-t5#training
Hi, thank you for your response. I tried adding the flag but I get this error ps : I am running the command from the directory
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 994, in _gcd_import
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 941, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 994, in _gcd_import
File "<frozen importlib._bootstrap>", line 971, in _find_and_load
File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'multilingual_t5'
You need to clone this repo and run the command from the repo directory. See https://github.com/google-research/multilingual-t5#training
That's what I am doing but i got that error, any ideas? I don't know what I am missing :/ Thank you in advance
I think there may be an error in the instructions!
Can you try:
python -m t5.models.mesh_transformer_main \
--module_import="multilingual_t5.tasks"
--model_dir="${MODEL_DIR}"
--gin_file="${PRETRAINED_DIR}/operative_config.gin"
--gin_file="sequence_lengths/xquad.gin"
--gin_param="utils.run.train_dataset_fn = @t5.models.mesh_transformer.tsv_dataset_fn"
--gin_param="tsv_dataset_fn.filename = '/t5-train/multilingual-t5/test.tsv'"
--gin_param="utils.run.train_steps=$((PRETRAINED_STEPS+FINETUNE_STEPS))"
--gin_param="utils.run.init_checkpoint='${PRETRAINED_DIR}/model.ckpt-${PRETRAINED_STEPS}'"
--gin_param="utils.run.mesh_shape = 'model:1,batch:1'"
--gin_param="utils.run.mesh_devices = ['gpu:1']"
--gin_location_prefix="multilingual_t5/gin/"
Now I get
../.conda/envs/t5/bin/python: No module named t5.models.t5_mesh_transformer
I installed t5 using pip install
Have you already pip installed t5?
You may need to do python3 -m t5.models.mesh_transformer_main
if you have python2.7 installed as python
.
Also, don't forget the --module_import="multilingual_t5.tasks"
line -- I accidentally left it off before.
Thank you for all the details, I still get the same error. I have t5 installed via pip this is the env I am running in
pip list
Package Version
------------------------ ---------------------
absl-py 0.11.0
argcomplete 1.12.1
astunparse 1.6.3
attrs 20.2.0
Babel 2.8.0
boto 2.49.0
cachetools 4.1.1
certifi 2020.6.20
cffi 1.14.3
chardet 3.0.4
click 7.1.2
crcmod 1.7
cryptography 3.2.1
dataclasses 0.7
dill 0.3.3
dm-tree 0.1.5
fasteners 0.15
filelock 3.0.12
future 0.18.2
gast 0.3.3
gcs-oauth2-boto-plugin 2.7
gin-config 0.3.0
google-apitools 0.5.31
google-auth 1.23.0
google-auth-oauthlib 0.4.2
google-pasta 0.2.0
google-reauth 0.1.0
googleapis-common-protos 1.52.0
grpcio 1.33.2
gsutil 4.54
h5py 2.10.0
httplib2 0.18.1
idna 2.10
importlib-metadata 2.0.0
importlib-resources 3.3.0
joblib 0.17.0
Keras-Preprocessing 1.1.2
Markdown 3.3.3
mesh-tensorflow 0.1.17
mock 2.0.0
monotonic 1.5
nltk 3.5
numpy 1.19.4
oauth2client 4.1.3
oauthlib 3.1.0
opt-einsum 3.3.0
packaging 20.4
pandas 1.1.4
pbr 5.5.1
pip 20.2.4
portalocker 2.0.0
promise 2.3
protobuf 3.13.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.20
pyOpenSSL 19.1.0
pyparsing 2.4.7
python-dateutil 2.8.1
pytz 2020.4
pyu2f 0.1.5
regex 2020.10.28
requests 2.24.0
requests-oauthlib 1.3.0
retry-decorator 1.1.1
rouge-score 0.0.4
rsa 4.6
sacrebleu 1.4.14
sacremoses 0.0.43
scikit-learn 0.23.2
scipy 1.5.3
sentencepiece 0.1.94
setuptools 50.3.0.post20201103
six 1.15.0
t5 0.7.1
tensorboard 2.3.0
tensorboard-plugin-wit 1.7.0
tensorflow 2.3.1
tensorflow-datasets 4.0.1
tensorflow-estimator 2.3.0
tensorflow-metadata 0.25.0
tensorflow-text 2.3.0
termcolor 1.1.0
tfds-nightly 4.0.1.dev202011030854
threadpoolctl 2.1.0
tokenizers 0.9.2
torch 1.7.0
tqdm 4.51.0
transformers 3.4.0
typing-extensions 3.7.4.3
urllib3 1.25.11
Werkzeug 1.0.1
wheel 0.35.1
wrapt 1.12.1
zipp 3.4.0
And you're using python3? If you open python and call import t5
does that work?
Yes
python --version
Python 3.6.12 :: Anaconda, Inc.
and it works when I test import t5
It looks like you used t5.models.mesh_transformer
instead of t5.models.mesh_transformer_main
My bad ^^" It works now. I get another error haha, I will check it :)
/multilingual-t5/multilingual_t5/tasks.py", line 42, in <module>
MC4_LANGS = tfds.text.c4.MC4_LANGUAGES
AttributeError: module 'tensorflow_datasets.text.c4' has no attribute 'MC4_LANGUAGES'
Sorry for the long issue, still got an error while trying to run it on gpu. I have the model and the data in my local machine
loading CUDA OK
2020-11-09 17:33:26.151373: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
WARNING:tensorflow:From /home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
INFO:tensorflow:model_type=bitransformer
I1109 17:33:31.661860 139718519514944 utils.py:2245] model_type=bitransformer
INFO:tensorflow:mode=train
I1109 17:33:31.662801 139718519514944 utils.py:2246] mode=train
INFO:tensorflow:sequence_length={'inputs': 1024, 'targets': 512}
I1109 17:33:31.663373 139718519514944 utils.py:2247] sequence_length={'inputs': 1024, 'targets': 512}
INFO:tensorflow:batch_size=1024
I1109 17:33:31.663927 139718519514944 utils.py:2248] batch_size=1024
INFO:tensorflow:train_steps=1020000
I1109 17:33:31.664472 139718519514944 utils.py:2249] train_steps=1020000
INFO:tensorflow:mesh_shape=model:1,batch:1
I1109 17:33:31.664912 139718519514944 utils.py:2250] mesh_shape=model:1,batch:1
INFO:tensorflow:layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch
I1109 17:33:31.665416 139718519514944 utils.py:2251] layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch
INFO:tensorflow:Building TPUConfig with tpu_job_name=None
I1109 17:33:31.665923 139718519514944 utils.py:2266] Building TPUConfig with tpu_job_name=None
2020-11-09 17:33:31.789120: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Failed precondition: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'".
Traceback (most recent call last):
File "/home/user/.conda/envs/t5/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/user/.conda/envs/t5/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 264, in <module>
console_entry_point()
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 261, in console_entry_point
app.run(main)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 255, in main
model_dir=FLAGS.model_dir)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/gin/config.py", line 1078, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
six.raise_from(proxy.with_traceback(exception.__traceback__), None)
File "<string>", line 3, in raise_from
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/gin/config.py", line 1055, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 2292, in run
mesh_devices=mesh_devices)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 1526, in get_estimator
input_vocab_size=inputs_vocabulary(vocabulary).vocab_size,
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/data/sentencepiece_vocabulary.py", line 106, in vocab_size
return self.tokenizer.GetPieceSize() + self._extra_ids # pylint:disable=unreachable
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/data/sentencepiece_vocabulary.py", line 91, in tokenizer
self._load_model()
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/data/sentencepiece_vocabulary.py", line 61, in _load_model
self._sp_model = f.read()
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 118, in read
length = self.size() - self.tell()
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 97, in size
return stat(self.__name).length
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 793, in stat
return stat_v2(filename)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 809, in stat_v2
return _pywrap_file_io.Stat(path)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Error executing an HTTP request: libcurl code 77 meaning 'Problem with the SSL CA cert (path? access rights?)', error details: error setting certificate verify locations:
CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: none
when reading metadata of gs://t5-data/vocabs/mc4.250000.100extra/sentencepiece.model
In call to configurable 'run' (<function run at 0x7f124c0fa048>)
I think something like this would work:
--gin_param="tsv_dataset_fn.vocabulary = SentencePieceVocabulary()"
--gin_param="SentencePieceVocabulary.sentencepiece_model_file = '/path/to/spm'"
Hello , I am wondering if it's possible to finetune t5 on gpus, I am struggling to make it work on 4x RTX 6000 I get an OOM with long log and this final error. Any help is appreciated. Thank you
2020-11-13 15:43:23.277335: I tensorflow/core/common_runtime/bfc_allocator.cc:1034] 4 Chunks of size 256114688 totalling 977.00MiB
2020-11-13 15:43:23.277560: I tensorflow/core/common_runtime/bfc_allocator.cc:1034] 9 Chunks of size 512229376 totalling 4.29GiB
2020-11-13 15:43:23.277759: I tensorflow/core/common_runtime/bfc_allocator.cc:1034] 1 Chunks of size 1024458752 totalling 977.00MiB
2020-11-13 15:43:23.278035: I tensorflow/core/common_runtime/bfc_allocator.cc:1038] Sum Total of in-use chunks: 20.47GiB
2020-11-13 15:43:23.278291: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] total_region_allocated_bytes_: 22002673920 memory_limit_: 22002673920 available bytes: 0 curr_region_allocation_bytes_: 44005347840
2020-11-13 15:43:23.278493: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Stats:
Limit: 22002673920
InUse: 21983674112
MaxInUse: 21996894720
NumAllocs: 7297
MaxAllocSize: 1024458752
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2020-11-13 15:43:23.278953: W tensorflow/core/common_runtime/bfc_allocator.cc:439] ****************************************************************************************************
2020-11-13 15:43:23.279251: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[1,2,1024,512] and type float on /job:localhost/replica:0/task:0/device:GPU:3 by allocator GPU_3_bfc
INFO:tensorflow:training_loop marked as finished
I1113 15:43:24.458347 140095770998592 error_handling.py:115] training_loop marked as finished
WARNING:tensorflow:Reraising captured error
W1113 15:43:24.460635 140095770998592 error_handling.py:149] Reraising captured error
Traceback (most recent call last):
File "/home/user/.conda/envs/t5/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/user/.conda/envs/t5/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 264, in <module>
console_entry_point()
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 261, in console_entry_point
app.run(main)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 255, in main
model_dir=FLAGS.model_dir)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/gin/config.py", line 1078, in gin_wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
six.raise_from(proxy.with_traceback(exception.__traceback__), None)
File "<string>", line 3, in raise_from
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/gin/config.py", line 1055, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 2302, in run
skip_seen_data=skip_seen_data)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 1615, in train_model
estimator.train(input_fn=input_fn, max_steps=train_steps)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3089, in train
rendezvous.raise_errors()
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
six.reraise(typ, value, traceback)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3084, in train
saving_listeners=saving_listeners)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1208, in _train_model_default
saving_listeners)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1511, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 778, in run
run_metadata=run_metadata)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1283, in run
run_metadata=run_metadata)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1384, in run
raise six.reraise(*original_exc_info)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1369, in run
return self._sess.run(*args, **kwargs)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1442, in run
run_metadata=run_metadata)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1200, in run
return self._sess.run(*args, **kwargs)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 958, in run
run_metadata_ptr)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1181, in _run
feed_dict_tensor, options, run_metadata)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 5 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[1,2,512,125056] and type float on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
[[node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot (defined at /site-packages/mesh_tensorflow/ops.py:3846) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[while_loop/while/Exit_4/_3935]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[1,2,512,125056] and type float on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
[[node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot (defined at /site-packages/mesh_tensorflow/ops.py:3846) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[while_loop/while/decoder/block_005/layer_002/dropout/einsum/parallel_1/split/_13695]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(2) Resource exhausted: OOM when allocating tensor with shape[1,2,512,125056] and type float on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
[[node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot (defined at /site-packages/mesh_tensorflow/ops.py:3846) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(3) Resource exhausted: OOM when allocating tensor with shape[1,2,512,125056] and type float on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
[[node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot (defined at /site-packages/mesh_tensorflow/ops.py:3846) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[while_loop/while/decoder/block_006/layer_000/SelfAttention/einsum_5/gradients/einsum/parallel_1_2/split/_14145]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(4) Resource exhausted: OOM when allocating tensor with shape[1,2,512,125056] and type float on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
[[node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot (defined at /site-packages/mesh_tensorflow/ops.py:3846) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[while_loop/while/decoder/block_003/layer_001/dropout/einsum/parallel_0_2/split/_13577]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot:
while_loop/while/decoder/one_hot_1/parallel_1_2/Cast/x (defined at /site-packages/mesh_tensorflow/ops.py:3844)
while_loop/while/decoder/one_hot_1/parallel_1_2/sub (defined at /site-packages/mesh_tensorflow/ops.py:3842)
while_loop/while/decoder/one_hot_1/parallel_1_2/Cast_1/x (defined at /site-packages/mesh_tensorflow/ops.py:3845)
Input Source operations connected to node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot:
while_loop/while/decoder/one_hot_1/parallel_1_2/Cast/x (defined at /site-packages/mesh_tensorflow/ops.py:3844)
while_loop/while/decoder/one_hot_1/parallel_1_2/sub (defined at /site-packages/mesh_tensorflow/ops.py:3842)
while_loop/while/decoder/one_hot_1/parallel_1_2/Cast_1/x (defined at /site-packages/mesh_tensorflow/ops.py:3845)
Input Source operations connected to node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot:
while_loop/while/decoder/one_hot_1/parallel_1_2/Cast/x (defined at /site-packages/mesh_tensorflow/ops.py:3844)
while_loop/while/decoder/one_hot_1/parallel_1_2/sub (defined at /site-packages/mesh_tensorflow/ops.py:3842)
while_loop/while/decoder/one_hot_1/parallel_1_2/Cast_1/x (defined at /site-packages/mesh_tensorflow/ops.py:3845)
Input Source operations connected to node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot:
while_loop/while/decoder/one_hot_1/parallel_1_2/Cast/x (defined at /site-packages/mesh_tensorflow/ops.py:3844)
while_loop/while/decoder/one_hot_1/parallel_1_2/sub (defined at /site-packages/mesh_tensorflow/ops.py:3842)
while_loop/while/decoder/one_hot_1/parallel_1_2/Cast_1/x (defined at /site-packages/mesh_tensorflow/ops.py:3845)
Input Source operations connected to node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot:
while_loop/while/decoder/one_hot_1/parallel_1_2/Cast/x (defined at /site-packages/mesh_tensorflow/ops.py:3844)
while_loop/while/decoder/one_hot_1/parallel_1_2/sub (defined at /site-packages/mesh_tensorflow/ops.py:3842)
while_loop/while/decoder/one_hot_1/parallel_1_2/Cast_1/x (defined at /site-packages/mesh_tensorflow/ops.py:3845)
Original stack trace for 'while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot':
File "/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/site-packages/t5/models/mesh_transformer_main.py", line 264, in <module>
console_entry_point()
File "/site-packages/t5/models/mesh_transformer_main.py", line 261, in console_entry_point
app.run(main)
File "/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/site-packages/t5/models/mesh_transformer_main.py", line 255, in main
model_dir=FLAGS.model_dir)
File "/site-packages/gin/config.py", line 1055, in gin_wrapper
return fn(*new_args, **new_kwargs)
File "/site-packages/mesh_tensorflow/transformer/utils.py", line 2302, in run
skip_seen_data=skip_seen_data)
File "/site-packages/mesh_tensorflow/transformer/utils.py", line 1615, in train_model
estimator.train(input_fn=input_fn, max_steps=train_steps)
File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3084, in train
saving_listeners=saving_listeners)
File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
self.config)
File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2921, in _call_model_fn
config)
File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3179, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1700, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2031, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "/site-packages/mesh_tensorflow/transformer/utils.py", line 686, in my_model_fn
log_file=model_info_file)
File "/site-packages/mesh_tensorflow/ops.py", line 726, in __init__
op.lower(self)
File "/site-packages/mesh_tensorflow/ops.py", line 6092, in lower
**self._tf_kwargs)
File "/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2774, in while_loop
return_same_structure)
File "/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2256, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2181, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/site-packages/mesh_tensorflow/ops.py", line 6064, in tf_body_fn
op.lower(lowering)
File "/site-packages/mesh_tensorflow/ops.py", line 3848, in lower
slicewise_fn, lowering.tensors[indices], offset)
File "/site-packages/mesh_tensorflow/placement_mesh_impl.py", line 173, in slicewise
ret = mtf.parallel(self.devices, fn, *inputs)
File "/site-packages/mesh_tensorflow/ops.py", line 5666, in parallel
ret.append(fn(*my_args, **my_kwargs))
File "/site-packages/mesh_tensorflow/ops.py", line 3846, in slicewise_fn
dtype=self._dtype)
File "/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/site-packages/tensorflow/python/ops/array_ops.py", line 4123, in one_hot
name)
File "/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6327, in one_hot
off_value=off_value, axis=axis, name=name)
File "/site-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "/site-packages/tensorflow/python/framework/ops.py", line 3485, in _create_op_internal
op_def=op_def)
File "/site-packages/tensorflow/python/framework/ops.py", line 1949, in __init__
self._traceback = tf_stack.extract_stack()
In call to configurable 'run' (<function run at 0x7f69b5e15048>)
I am using --gin_param="utils.run.mesh_shape = 'model:1,batch:2'"
@adarob
Hi! I seem to encounter the same error as above after tried all the above solutions.
The error:
ValueError: No Task or Mixture found with name: xnli_zeroshot
In call to configurable 'get_vocabulary' (<function get_vocabulary at 0x7fe995422598>)
My python and t5 version:
Python 3.6.13
Name: t5. 0.7.1
My script (run under the cloned multilingual-t5
path, commit id a0a42fc
python -m t5.models.mesh_transformer_main \
--module_import="multilingual_t5.tasks" \
--tpu="${TPU}" \
--gcp_project="${PROJECT_NAME}" \
--tpu_zone="${TPU_ZONE}" \
--model_dir="${GS_FOLDER}" \
--gin_file="gs://t5-data/pretrained_models/mt5/base/operative_config.gin" \
--gin_param="init_checkpoint = 'gs://t5-data/pretrained_models/mt5/base/model.ckpt-1000000'" \
--gin_param="run.train_steps = 1100000" \
--gin_param="run.save_checkpoints_steps = 10000" \
--gin_param="utils.run.batch_size=('tokens_per_batch', 65536)" \
--gin_param="utils.run.train_dataset_fn = @t5.models.mesh_transformer.tsv_dataset_fn" \
--gin_param="tsv_dataset_fn.filename = '${GS_FOLDER}/train.tsv' " \
--gin_location_prefix="multilingual_t5/gin/"
any idea on what I missed here?
Hello, I closed the other issue by mistake so I will post my new error here. I am trying to fine tune mT5 on a my dataset but I couldn't make it work, I am installing T5 via pip install . Any help is appreciated.
My error
This is the command i use for finetuning