google-research / multilingual-t5

Apache License 2.0
1.25k stars 128 forks source link

can't reproduce finetuning #19

Closed SarraCode closed 3 years ago

SarraCode commented 4 years ago

Hello, I closed the other issue by mistake so I will post my new error here. I am trying to fine tune mT5 on a my dataset but I couldn't make it work, I am installing T5 via pip install . Any help is appreciated.

My error

2020-11-04 17:32:58.543624: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
WARNING:tensorflow:From /home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
....;
    task_or_mixture_name)
ValueError: No Task or Mixture found with name: xnli_zeroshot
  In call to configurable 'get_vocabulary' (<function get_vocabulary at 0x7f57c77d3b70>)

This is the command i use for finetuning

PRETRAINED_DIR="/mt5/small"
PRETRAINED_STEPS=1000000
FINETUNE_STEPS=20000
MODEL_DIR="//models

t5_mesh_transformer
--model_dir="${MODEL_DIR}"
--gin_file="${PRETRAINED_DIR}/operative_config.gin"
--gin_file="sequence_lengths/xquad.gin"
--gin_param="utils.run.train_dataset_fn = @t5.models.mesh_transformer.tsv_dataset_fn"
--gin_param="tsv_dataset_fn.filename = '/t5-train/multilingual-t5/test.tsv'"
--gin_param="utils.run.train_steps=$((PRETRAINED_STEPS+FINETUNE_STEPS))"
--gin_param="utils.run.init_checkpoint='${PRETRAINED_DIR}/model.ckpt-${PRETRAINED_STEPS}'"
--gin_param="utils.run.mesh_shape = 'model:1,batch:1'"
--gin_param="utils.run.mesh_devices = ['gpu:1']"
--gin_location_prefix="multilingual_t5/gin/"
craffel commented 3 years ago

Hi, you need the --module_import="multilingual_t5.tasks" flag, see https://github.com/google-research/multilingual-t5#training

SarraCode commented 3 years ago

Hi, thank you for your response. I tried adding the flag but I get this error ps : I am running the command from the directory

return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 941, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'multilingual_t5'
craffel commented 3 years ago

You need to clone this repo and run the command from the repo directory. See https://github.com/google-research/multilingual-t5#training

SarraCode commented 3 years ago

That's what I am doing but i got that error, any ideas? I don't know what I am missing :/ Thank you in advance

adarob commented 3 years ago

I think there may be an error in the instructions!

Can you try:

python -m t5.models.mesh_transformer_main \
--module_import="multilingual_t5.tasks"
--model_dir="${MODEL_DIR}"
--gin_file="${PRETRAINED_DIR}/operative_config.gin"
--gin_file="sequence_lengths/xquad.gin"
--gin_param="utils.run.train_dataset_fn = @t5.models.mesh_transformer.tsv_dataset_fn"
--gin_param="tsv_dataset_fn.filename = '/t5-train/multilingual-t5/test.tsv'"
--gin_param="utils.run.train_steps=$((PRETRAINED_STEPS+FINETUNE_STEPS))"
--gin_param="utils.run.init_checkpoint='${PRETRAINED_DIR}/model.ckpt-${PRETRAINED_STEPS}'"
--gin_param="utils.run.mesh_shape = 'model:1,batch:1'"
--gin_param="utils.run.mesh_devices = ['gpu:1']"
--gin_location_prefix="multilingual_t5/gin/"
SarraCode commented 3 years ago

Now I get ../.conda/envs/t5/bin/python: No module named t5.models.t5_mesh_transformer I installed t5 using pip install

adarob commented 3 years ago

Have you already pip installed t5?

adarob commented 3 years ago

You may need to do python3 -m t5.models.mesh_transformer_main if you have python2.7 installed as python.

adarob commented 3 years ago

Also, don't forget the --module_import="multilingual_t5.tasks" line -- I accidentally left it off before.

SarraCode commented 3 years ago

Thank you for all the details, I still get the same error. I have t5 installed via pip this is the env I am running in

pip list
Package                  Version
------------------------ ---------------------
absl-py                  0.11.0
argcomplete              1.12.1
astunparse               1.6.3
attrs                    20.2.0
Babel                    2.8.0
boto                     2.49.0
cachetools               4.1.1
certifi                  2020.6.20
cffi                     1.14.3
chardet                  3.0.4
click                    7.1.2
crcmod                   1.7
cryptography             3.2.1
dataclasses              0.7
dill                     0.3.3
dm-tree                  0.1.5
fasteners                0.15
filelock                 3.0.12
future                   0.18.2
gast                     0.3.3
gcs-oauth2-boto-plugin   2.7
gin-config               0.3.0
google-apitools          0.5.31
google-auth              1.23.0
google-auth-oauthlib     0.4.2
google-pasta             0.2.0
google-reauth            0.1.0
googleapis-common-protos 1.52.0
grpcio                   1.33.2
gsutil                   4.54
h5py                     2.10.0
httplib2                 0.18.1
idna                     2.10
importlib-metadata       2.0.0
importlib-resources      3.3.0
joblib                   0.17.0
Keras-Preprocessing      1.1.2
Markdown                 3.3.3
mesh-tensorflow          0.1.17
mock                     2.0.0
monotonic                1.5
nltk                     3.5
numpy                    1.19.4
oauth2client             4.1.3
oauthlib                 3.1.0
opt-einsum               3.3.0
packaging                20.4
pandas                   1.1.4
pbr                      5.5.1
pip                      20.2.4
portalocker              2.0.0
promise                  2.3
protobuf                 3.13.0
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pycparser                2.20
pyOpenSSL                19.1.0
pyparsing                2.4.7
python-dateutil          2.8.1
pytz                     2020.4
pyu2f                    0.1.5
regex                    2020.10.28
requests                 2.24.0
requests-oauthlib        1.3.0
retry-decorator          1.1.1
rouge-score              0.0.4
rsa                      4.6
sacrebleu                1.4.14
sacremoses               0.0.43
scikit-learn             0.23.2
scipy                    1.5.3
sentencepiece            0.1.94
setuptools               50.3.0.post20201103
six                      1.15.0
t5                       0.7.1
tensorboard              2.3.0
tensorboard-plugin-wit   1.7.0
tensorflow               2.3.1
tensorflow-datasets      4.0.1
tensorflow-estimator     2.3.0
tensorflow-metadata      0.25.0
tensorflow-text          2.3.0
termcolor                1.1.0
tfds-nightly             4.0.1.dev202011030854
threadpoolctl            2.1.0
tokenizers               0.9.2
torch                    1.7.0
tqdm                     4.51.0
transformers             3.4.0
typing-extensions        3.7.4.3
urllib3                  1.25.11
Werkzeug                 1.0.1
wheel                    0.35.1
wrapt                    1.12.1
zipp                     3.4.0
adarob commented 3 years ago

And you're using python3? If you open python and call import t5 does that work?

SarraCode commented 3 years ago

Yes

python --version
Python 3.6.12 :: Anaconda, Inc.

and it works when I test import t5

adarob commented 3 years ago

It looks like you used t5.models.mesh_transformer instead of t5.models.mesh_transformer_main

SarraCode commented 3 years ago

My bad ^^" It works now. I get another error haha, I will check it :)

/multilingual-t5/multilingual_t5/tasks.py", line 42, in <module>
    MC4_LANGS = tfds.text.c4.MC4_LANGUAGES
AttributeError: module 'tensorflow_datasets.text.c4' has no attribute 'MC4_LANGUAGES'
SarraCode commented 3 years ago

Sorry for the long issue, still got an error while trying to run it on gpu. I have the model and the data in my local machine

loading CUDA OK
2020-11-09 17:33:26.151373: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
WARNING:tensorflow:From /home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
INFO:tensorflow:model_type=bitransformer
I1109 17:33:31.661860 139718519514944 utils.py:2245] model_type=bitransformer
INFO:tensorflow:mode=train
I1109 17:33:31.662801 139718519514944 utils.py:2246] mode=train
INFO:tensorflow:sequence_length={'inputs': 1024, 'targets': 512}
I1109 17:33:31.663373 139718519514944 utils.py:2247] sequence_length={'inputs': 1024, 'targets': 512}
INFO:tensorflow:batch_size=1024
I1109 17:33:31.663927 139718519514944 utils.py:2248] batch_size=1024
INFO:tensorflow:train_steps=1020000
I1109 17:33:31.664472 139718519514944 utils.py:2249] train_steps=1020000
INFO:tensorflow:mesh_shape=model:1,batch:1
I1109 17:33:31.664912 139718519514944 utils.py:2250] mesh_shape=model:1,batch:1
INFO:tensorflow:layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch
I1109 17:33:31.665416 139718519514944 utils.py:2251] layout_rules=ensemble:ensemble,batch:batch,d_ff:model,heads:model,vocab:model,experts:batch
INFO:tensorflow:Building TPUConfig with tpu_job_name=None
I1109 17:33:31.665923 139718519514944 utils.py:2266] Building TPUConfig with tpu_job_name=None
2020-11-09 17:33:31.789120: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Failed precondition: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Couldn't resolve host 'metadata'".
Traceback (most recent call last):
  File "/home/user/.conda/envs/t5/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/user/.conda/envs/t5/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 264, in <module>
    console_entry_point()
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 261, in console_entry_point
    app.run(main)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 255, in main
    model_dir=FLAGS.model_dir)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/gin/config.py", line 1078, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/gin/config.py", line 1055, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 2292, in run
    mesh_devices=mesh_devices)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 1526, in get_estimator
    input_vocab_size=inputs_vocabulary(vocabulary).vocab_size,
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/data/sentencepiece_vocabulary.py", line 106, in vocab_size
    return self.tokenizer.GetPieceSize() + self._extra_ids  # pylint:disable=unreachable
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/data/sentencepiece_vocabulary.py", line 91, in tokenizer
    self._load_model()
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/data/sentencepiece_vocabulary.py", line 61, in _load_model
    self._sp_model = f.read()
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 118, in read
    length = self.size() - self.tell()
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 97, in size
    return stat(self.__name).length
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 793, in stat
    return stat_v2(filename)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 809, in stat_v2
    return _pywrap_file_io.Stat(path)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Error executing an HTTP request: libcurl code 77 meaning 'Problem with the SSL CA cert (path? access rights?)', error details: error setting certificate verify locations:
  CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: none
     when reading metadata of gs://t5-data/vocabs/mc4.250000.100extra/sentencepiece.model
  In call to configurable 'run' (<function run at 0x7f124c0fa048>)
adarob commented 3 years ago

I think something like this would work:

--gin_param="tsv_dataset_fn.vocabulary = SentencePieceVocabulary()"
--gin_param="SentencePieceVocabulary.sentencepiece_model_file = '/path/to/spm'"
SarraCode commented 3 years ago

Hello , I am wondering if it's possible to finetune t5 on gpus, I am struggling to make it work on 4x RTX 6000 I get an OOM with long log and this final error. Any help is appreciated. Thank you


2020-11-13 15:43:23.277335: I tensorflow/core/common_runtime/bfc_allocator.cc:1034] 4 Chunks of size 256114688 totalling 977.00MiB
2020-11-13 15:43:23.277560: I tensorflow/core/common_runtime/bfc_allocator.cc:1034] 9 Chunks of size 512229376 totalling 4.29GiB
2020-11-13 15:43:23.277759: I tensorflow/core/common_runtime/bfc_allocator.cc:1034] 1 Chunks of size 1024458752 totalling 977.00MiB
2020-11-13 15:43:23.278035: I tensorflow/core/common_runtime/bfc_allocator.cc:1038] Sum Total of in-use chunks: 20.47GiB
2020-11-13 15:43:23.278291: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] total_region_allocated_bytes_: 22002673920 memory_limit_: 22002673920 available bytes: 0 curr_region_allocation_bytes_: 44005347840
2020-11-13 15:43:23.278493: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Stats:
Limit:                     22002673920
InUse:                     21983674112
MaxInUse:                  21996894720
NumAllocs:                        7297
MaxAllocSize:               1024458752
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2020-11-13 15:43:23.278953: W tensorflow/core/common_runtime/bfc_allocator.cc:439] ****************************************************************************************************
2020-11-13 15:43:23.279251: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at cwise_ops_common.cc:82 : Resource exhausted: OOM when allocating tensor with shape[1,2,1024,512] and type float on /job:localhost/replica:0/task:0/device:GPU:3 by allocator GPU_3_bfc
INFO:tensorflow:training_loop marked as finished
I1113 15:43:24.458347 140095770998592 error_handling.py:115] training_loop marked as finished
WARNING:tensorflow:Reraising captured error
W1113 15:43:24.460635 140095770998592 error_handling.py:149] Reraising captured error
Traceback (most recent call last):
  File "/home/user/.conda/envs/t5/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/user/.conda/envs/t5/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 264, in <module>
    console_entry_point()
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 261, in console_entry_point
    app.run(main)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/t5/models/mesh_transformer_main.py", line 255, in main
    model_dir=FLAGS.model_dir)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/gin/config.py", line 1078, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/gin/config.py", line 1055, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 2302, in run
    skip_seen_data=skip_seen_data)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/mesh_tensorflow/transformer/utils.py", line 1615, in train_model
    estimator.train(input_fn=input_fn, max_steps=train_steps)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3089, in train
    rendezvous.raise_errors()
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3084, in train
    saving_listeners=saving_listeners)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1208, in _train_model_default
    saving_listeners)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1511, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 778, in run
    run_metadata=run_metadata)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1283, in run
    run_metadata=run_metadata)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1384, in run
    raise six.reraise(*original_exc_info)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1369, in run
    return self._sess.run(*args, **kwargs)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1442, in run
    run_metadata=run_metadata)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1200, in run
    return self._sess.run(*args, **kwargs)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 958, in run
    run_metadata_ptr)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1181, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/user/.conda/envs/t5/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 5 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[1,2,512,125056] and type float on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
     [[node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot (defined at /site-packages/mesh_tensorflow/ops.py:3846) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[while_loop/while/Exit_4/_3935]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[1,2,512,125056] and type float on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
     [[node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot (defined at /site-packages/mesh_tensorflow/ops.py:3846) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[while_loop/while/decoder/block_005/layer_002/dropout/einsum/parallel_1/split/_13695]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (2) Resource exhausted: OOM when allocating tensor with shape[1,2,512,125056] and type float on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
     [[node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot (defined at /site-packages/mesh_tensorflow/ops.py:3846) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (3) Resource exhausted: OOM when allocating tensor with shape[1,2,512,125056] and type float on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
     [[node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot (defined at /site-packages/mesh_tensorflow/ops.py:3846) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[while_loop/while/decoder/block_006/layer_000/SelfAttention/einsum_5/gradients/einsum/parallel_1_2/split/_14145]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (4) Resource exhausted: OOM when allocating tensor with shape[1,2,512,125056] and type float on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
     [[node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot (defined at /site-packages/mesh_tensorflow/ops.py:3846) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[while_loop/while/decoder/block_003/layer_001/dropout/einsum/parallel_0_2/split/_13577]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot:
 while_loop/while/decoder/one_hot_1/parallel_1_2/Cast/x (defined at /site-packages/mesh_tensorflow/ops.py:3844)
 while_loop/while/decoder/one_hot_1/parallel_1_2/sub (defined at /site-packages/mesh_tensorflow/ops.py:3842)
 while_loop/while/decoder/one_hot_1/parallel_1_2/Cast_1/x (defined at /site-packages/mesh_tensorflow/ops.py:3845)

Input Source operations connected to node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot:
 while_loop/while/decoder/one_hot_1/parallel_1_2/Cast/x (defined at /site-packages/mesh_tensorflow/ops.py:3844)
 while_loop/while/decoder/one_hot_1/parallel_1_2/sub (defined at /site-packages/mesh_tensorflow/ops.py:3842)
 while_loop/while/decoder/one_hot_1/parallel_1_2/Cast_1/x (defined at /site-packages/mesh_tensorflow/ops.py:3845)

Input Source operations connected to node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot:
 while_loop/while/decoder/one_hot_1/parallel_1_2/Cast/x (defined at /site-packages/mesh_tensorflow/ops.py:3844)
 while_loop/while/decoder/one_hot_1/parallel_1_2/sub (defined at /site-packages/mesh_tensorflow/ops.py:3842)
 while_loop/while/decoder/one_hot_1/parallel_1_2/Cast_1/x (defined at /site-packages/mesh_tensorflow/ops.py:3845)

Input Source operations connected to node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot:
 while_loop/while/decoder/one_hot_1/parallel_1_2/Cast/x (defined at /site-packages/mesh_tensorflow/ops.py:3844)
 while_loop/while/decoder/one_hot_1/parallel_1_2/sub (defined at /site-packages/mesh_tensorflow/ops.py:3842)
 while_loop/while/decoder/one_hot_1/parallel_1_2/Cast_1/x (defined at /site-packages/mesh_tensorflow/ops.py:3845)

Input Source operations connected to node while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot:
 while_loop/while/decoder/one_hot_1/parallel_1_2/Cast/x (defined at /site-packages/mesh_tensorflow/ops.py:3844)
 while_loop/while/decoder/one_hot_1/parallel_1_2/sub (defined at /site-packages/mesh_tensorflow/ops.py:3842)
 while_loop/while/decoder/one_hot_1/parallel_1_2/Cast_1/x (defined at /site-packages/mesh_tensorflow/ops.py:3845)

Original stack trace for 'while_loop/while/decoder/one_hot_1/parallel_1_2/one_hot':
  File "/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/site-packages/t5/models/mesh_transformer_main.py", line 264, in <module>
    console_entry_point()
  File "/site-packages/t5/models/mesh_transformer_main.py", line 261, in console_entry_point
    app.run(main)
  File "/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/site-packages/t5/models/mesh_transformer_main.py", line 255, in main
    model_dir=FLAGS.model_dir)
  File "/site-packages/gin/config.py", line 1055, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/site-packages/mesh_tensorflow/transformer/utils.py", line 2302, in run
    skip_seen_data=skip_seen_data)
  File "/site-packages/mesh_tensorflow/transformer/utils.py", line 1615, in train_model
    estimator.train(input_fn=input_fn, max_steps=train_steps)
  File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3084, in train
    saving_listeners=saving_listeners)
  File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
    self.config)
  File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2921, in _call_model_fn
    config)
  File "/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3179, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1700, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2031, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "/site-packages/mesh_tensorflow/transformer/utils.py", line 686, in my_model_fn
    log_file=model_info_file)
  File "/site-packages/mesh_tensorflow/ops.py", line 726, in __init__
    op.lower(self)
  File "/site-packages/mesh_tensorflow/ops.py", line 6092, in lower
    **self._tf_kwargs)
  File "/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2774, in while_loop
    return_same_structure)
  File "/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2256, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2181, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/site-packages/mesh_tensorflow/ops.py", line 6064, in tf_body_fn
    op.lower(lowering)
  File "/site-packages/mesh_tensorflow/ops.py", line 3848, in lower
    slicewise_fn, lowering.tensors[indices], offset)
  File "/site-packages/mesh_tensorflow/placement_mesh_impl.py", line 173, in slicewise
    ret = mtf.parallel(self.devices, fn, *inputs)
  File "/site-packages/mesh_tensorflow/ops.py", line 5666, in parallel
    ret.append(fn(*my_args, **my_kwargs))
  File "/site-packages/mesh_tensorflow/ops.py", line 3846, in slicewise_fn
    dtype=self._dtype)
  File "/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/site-packages/tensorflow/python/ops/array_ops.py", line 4123, in one_hot
    name)
  File "/site-packages/tensorflow/python/ops/gen_array_ops.py", line 6327, in one_hot
    off_value=off_value, axis=axis, name=name)
  File "/site-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/site-packages/tensorflow/python/framework/ops.py", line 3485, in _create_op_internal
    op_def=op_def)
  File "/site-packages/tensorflow/python/framework/ops.py", line 1949, in __init__
    self._traceback = tf_stack.extract_stack()

  In call to configurable 'run' (<function run at 0x7f69b5e15048>)

I am using --gin_param="utils.run.mesh_shape = 'model:1,batch:2'"
@adarob

crystina-z commented 3 years ago

Hi! I seem to encounter the same error as above after tried all the above solutions.

The error:

ValueError: No Task or Mixture found with name: xnli_zeroshot
  In call to configurable 'get_vocabulary' (<function get_vocabulary at 0x7fe995422598>)

My python and t5 version:

Python 3.6.13
Name: t5.  0.7.1

My script (run under the cloned multilingual-t5 path, commit id a0a42fc

python -m t5.models.mesh_transformer_main \
  --module_import="multilingual_t5.tasks" \
  --tpu="${TPU}" \
  --gcp_project="${PROJECT_NAME}" \
  --tpu_zone="${TPU_ZONE}" \
  --model_dir="${GS_FOLDER}" \
  --gin_file="gs://t5-data/pretrained_models/mt5/base/operative_config.gin" \
  --gin_param="init_checkpoint = 'gs://t5-data/pretrained_models/mt5/base/model.ckpt-1000000'" \
  --gin_param="run.train_steps = 1100000" \
  --gin_param="run.save_checkpoints_steps = 10000" \
  --gin_param="utils.run.batch_size=('tokens_per_batch', 65536)" \
  --gin_param="utils.run.train_dataset_fn = @t5.models.mesh_transformer.tsv_dataset_fn" \
  --gin_param="tsv_dataset_fn.filename = '${GS_FOLDER}/train.tsv' " \
  --gin_location_prefix="multilingual_t5/gin/"

any idea on what I missed here?