[trainer] seq2seq doesn't handle mt5 correctly

mxa4646 commented 3 years ago

Environment info

transformers version: 4.2.2
Platform: Linux-5.4.0-58-generic-x86_64-with-debian-buster-sid
Python version: 3.7.7
PyTorch version (GPU?): 1.7.1 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help

@stas00,@patrickvonplaten, @patil-suraj

Information

Model I am using (MT5-xl,MT5-large):

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[x] an official GLUE/SQUaD task: (official example scripts task)
[ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

The script I used is exmaples/seq2seq/finetune_trainer.py, which was originally used to reproduce the training of T5-3b on single 3090. All processes are the same as #8771 and it can reproduce the training of T5-3b(whether single card or 2/4 cards).

Here is the problem, when I try to train MT5-xl, --freeze_embeds seems to bring bugs. I used 4*3090, My script is

export BS=1; PYTHONPATH=../../src; USE_TF=0;
/usr/bin/time -v deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path /<my_model_dir>/models/mt5/xl/v0 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16

Here is my report:

[2021-01-27 14:59:52,982] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-27 14:59:57,024] [INFO] [runner.py:358:main] cmd = /<my_dir>/miniconda3/envs/nlp/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path /<my_model_dir>/models/mt5/xl/v0 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16
[2021-01-27 14:59:57,793] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2021-01-27 14:59:57,793] [INFO] [launch.py:87:main] nnodes=1, num_local_procs=4, node_rank=0
[2021-01-27 14:59:57,793] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2021-01-27 14:59:57,793] [INFO] [launch.py:100:main] dist_world_size=4
[2021-01-27 14:59:57,793] [INFO] [launch.py:103:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2021-01-27 15:00:01,106] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,340] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,672] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
[2021-01-27 15:00:01,870] [INFO] [distributed.py:40:init_distributed] Initializing torch distributed with backend: nccl
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True
01/27/2021 15:00:05 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, evaluation_strategy=<EvaluationStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_steps=5, logging_dir='runs/Jan27_15-00-01_user-SYS-4029GP-TRT', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level='O1', fp16_backend='auto', local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=25000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed='ds_config.json', label_smoothing_factor=0.1, adafactor=False, sortish_sampler=True, predict_with_generate=True)
01/27/2021 15:00:05 - WARNING - __main__ -   Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: True
[INFO|configuration_utils.py:443] 2021-01-27 15:00:05,352 >> loading configuration file /<my_model_dir>/models/mt5/xl/v0/config.json
[INFO|configuration_utils.py:481] 2021-01-27 15:00:05,353 >> Model config MT5Config {
  "_name_or_path": "/home/patrick/t5/mt5-xl",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 5120,
  "d_kv": 64,
  "d_model": 2048,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 24,
  "num_heads": 32,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.2.1",
  "use_cache": true,
  "vocab_size": 250112
}

[INFO|configuration_utils.py:443] 2021-01-27 15:00:05,353 >> loading configuration file /<my_model_dir>/models/mt5/xl/v0/config.json
[INFO|configuration_utils.py:481] 2021-01-27 15:00:05,354 >> Model config MT5Config {
  "_name_or_path": "/home/patrick/t5/mt5-xl",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 5120,
  "d_kv": 64,
  "d_model": 2048,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 24,
  "num_heads": 32,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.2.1",
  "use_cache": true,
  "vocab_size": 250112
}

[INFO|tokenization_utils_base.py:1685] 2021-01-27 15:00:05,354 >> Model name '/<my_model_dir>/models/mt5/xl/v0' not found in model shortcut name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). Assuming '/<my_model_dir>/models/mt5/xl/v0' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,354 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/special_tokens_map.json. We won't load it.
[INFO|tokenization_utils_base.py:1718] 2021-01-27 15:00:05,355 >> Didn't find file /<my_model_dir>/models/mt5/xl/v0/tokenizer_config.json. We won't load it.
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file /<my_model_dir>/models/mt5/xl/v0/spiece.model
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|tokenization_utils_base.py:1764] 2021-01-27 15:00:05,355 >> loading file None
[INFO|modeling_utils.py:1025] 2021-01-27 15:00:06,472 >> loading weights file /<my_model_dir>/models/mt5/xl/v0/pytorch_model.bin
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
[INFO|modeling_utils.py:1143] 2021-01-27 15:05:03,683 >> All model checkpoint weights were used when initializing MT5ForConditionalGeneration.

[INFO|modeling_utils.py:1152] 2021-01-27 15:05:03,683 >> All the weights of MT5ForConditionalGeneration were initialized from the model checkpoint at /<my_model_dir>/models/mt5/xl/v0.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MT5ForConditionalGeneration for predictions without further training.
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 230, in main
    freeze_embeds(model)
  File "/<my_dir>/transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds
    freeze_params(model.model.shared)
  File "/<my_dir>/miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'
    Command being timed: "deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path /<my_model_dir>/models/mt5/xl/v0 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16"
    User time (seconds): 348.34
    System time (seconds): 177.55
    Percent of CPU this job got: 166%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 5:15.88
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 33558800
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 1
    Minor (reclaiming a frame) page faults: 67111048
    Voluntary context switches: 132337
    Involuntary context switches: 6635761
    Swaps: 0
    File system inputs: 29248712
    File system outputs: 32
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

So I removed --freeze_embeds and tried to train MT5-xl again, but I got CUDA out of memory. My device is 4*24G 3090, with BS=1, ZeRO stage=2, and CPU_offload=true. I assume that T5-3b and MT5-xl should be in the same order of magnitude, and I can do it on t5-3b, so I think this should not happen.
I also tried training MT5-large. Just replace mt5-xl to mt5-large, under the same conditions in 3. And I got the overflow problem. This is not surprising me because MT5-large seems not fixed FP16 yet. In short, I want to know if there is any problem with my operation or if this is the case. If it is because the MT5-large has not been repaired, does huggingface have any plans to repair it?

Expected behavior

Why can't mt5-xl train on 4*3090? Or what should I do?
Can mt5-large FP16 (mainly DeepSpeed) be used? If not, is there any plan to fix it?

stas00 commented 3 years ago

OK, I can reproduce the problem with just google/mt5-small and 2 gpus:

export BS=1; PYTHONPATH=../../src USE_TF=0 deepspeed --num_gpus=2 ./finetune_trainer.py --model_name_or_path google/mt5-small --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16

We will get it sorted out today.

stas00 commented 3 years ago

ok, the problem had nothing to do with DeepSpeed, it's just a seq2seq neglect.

The fix is:

diff --git a/examples/seq2seq/utils.py b/examples/seq2seq/utils.py
index 8b24bfda..303b89f7 100644
--- a/examples/seq2seq/utils.py
+++ b/examples/seq2seq/utils.py
@@ -563,7 +563,7 @@ def freeze_embeds(model):
     """Freeze token embeddings and positional embeddings for bart, just token embeddings for t5."""
     model_type = model.config.model_type

-    if model_type == "t5":
+    if model_type in ["t5", "mt5"]:
         freeze_params(model.shared)
         for d in [model.encoder, model.decoder]:
             freeze_params(d.embed_tokens)

Please let me know if you can manage to apply this fix. I will make a proper PR later, but it'll take some work, since I need to make a tiny mt5 model and add a test.

You can just edit the file if you don't know how to apply a patch.

stas00 commented 3 years ago

The fix should be merged shortly https://github.com/huggingface/transformers/pull/9879

mxa4646 commented 3 years ago

I can solve the --freeze_embeds bug now, thanks for your help! @stas00

As for questions 3 and 4, I noticed that the title of the issue has been edited. I don't know if these questions are caused by the model or the seq2seq trainer. Maybe I should raise them in a new issue?

stas00 commented 3 years ago

Oh, you wrote those items as steps to reproduce the problem, so I didn't know that those were issues that needed to/could be fixed.

Once I discovered that the issue you posted was unrelated to DeepSpeed I took the liberty to adjust the subject.

In general, yes, let's try to keep each issue separate, so that it makes it much easier to track things and not let things fall between the cracks.

Back to your follow up question:

Looking just at the params:

t5-3b ~10GB
mt5-xl ~15GB

So the 2nd model is substantially larger, and if t5-3b fit tightly onto a 24GB card it's not surprising that the larger model didn't.

and in addition to model params you also need to allocate memory for:

inputs
gradients
optimizer states

I tried mt5-xl on 4x 40gb gpu setup and it worked, but took ~29GB on each GPU, so there is the problem - you're 5GB short.

The command I run was:

export BS=1; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path google/mt5-xl --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size 1 --per_device_train_batch_size 1 --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config.json --fp16

You may try to tweak the buffer sizes in ds_config.json but I think the gap is too big.

I'm working on a 2D Parallelism solution that will combine pipe|model-parallelism w/ ZeRO-DP (DeepSpeed), which should enable such feats with huge models, but it might take some time. The docs aren't quite there so it takes a lot of trial and error to move forward. You may want to track this PR https://github.com/huggingface/transformers/pull/9765 for updates.

Alternatively when fairscale or DeepSpeed releases ZeRO phase 3, you shouldn't have a problem loading this model onto 4x 24GB gpus. Currently the problem is that the model params are too big w/o phase 3. In phase 3 params are partitioned too - problem solved.

mxa4646 commented 3 years ago

I tried mt5-xl on 4x 40gb gpu setup and it worked, but took ~29GB on each GPU, so there is the problem - you're 5GB short.

That's help a lot! Thank you!

I am also looking forward to ZeRO stage 3 and your pipe|model-parallelism. Hope one day we can working on it. Thank you again!

patil-suraj commented 3 years ago

And I got the overflow problem. This is not surprising me because MT5-large seems not fixed FP16 yet.

Did you get nan loss or gradient overflow warning ? And yes, fp16 is still not working for mT5-large

I assume that T5-3b and MT5-xl should be in the same order of magnitude

mT5-xl is actually quite bigger than T5-3b for two reasons

It's vocab_size is huge (250112), which results in bigger token_embedding layer and final linear layer.
It's based on t51.1 which uses gated-gelu activation instead of relu. gated-gelu adds one extra linear layer in every feed-forward layer.

mxa4646 commented 3 years ago

@patil-suraj That's very helpful! Thank you a lot!

Now I understand that there are many differences between mT5-xl and T5-3b, and I will set up separate experiments for them in the future. By the way, do you have any plans to repair the FP16 in mt5-large/xl ?

dorost1234 commented 3 years ago

Dear @patil-suraj, here you have mentioned for mt5-small you have made it work with fp16? since you did not mention this model, do you mind telling me how you made it work? I am having a hard time with mt5-small with fp16 thanks a lot for your advice

loretoparisi commented 2 years ago

I have a similar error here

from transformers import T5TokenizerFast, MT5ForConditionalGeneration

tokenizer = T5TokenizerFast.from_pretrained('google/mt5-base') # "google/mt5-base" "google/mt5-large" "google/mt5-xl"

model = MT5ForConditionalGeneration.from_pretrained('google/mt5-base', return_dict=True)

condition = "translate English to German: "
input = "My name is Azeem and I live in India"

# You can also use "translate English to French" and "translate English to Romanian"
input_ids = tokenizer(condition+input, return_tensors="pt").input_ids  # Batch size 1

outputs = model.generate(input_ids)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(decoded)

Stacktrace:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-8-f9822d331a70>](https://localhost:8080/#) in <module>()
      3 tokenizer = T5TokenizerFast.from_pretrained('google/mt5-base') # "google/mt5-base" "google/mt5-large" "google/mt5-xl"
      4 
----> 5 model = AutoModelForSeq2SeqLM.from_pretrained('google/mt5-base', return_dict=True)
      6 
      7 condition = "translate English to German: "

8 frames
[/usr/local/lib/python3.7/dist-packages/transformers/configuration_utils.py](https://localhost:8080/#) in __getattribute__(self, key)
    250         if key != "attribute_map" and key in super().__getattribute__("attribute_map"):
    251             key = super().__getattribute__("attribute_map")[key]
--> 252         return super().__getattribute__(key)
    253 
    254     def __init__(self, **kwargs):

AttributeError: 'MT5Config' object has no attribute 'relative_attention_max_distance'

@stas00 any idea? I'm using HF master:

!pip install git+https://github.com/huggingface/transformers.git

patil-suraj commented 2 years ago

@loretoparisi

This is because T5Config now has relative_attention_max_distance attribute introduced in the #16155 which was missing from MT5Config. Fix is here #16170

huggingface / transformers