huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.18k stars 27.06k forks source link

Seq2seq now has larger memory requirements, OOM w/Deepspeed on previously runnable models #10161

Closed PeterAJansen closed 3 years ago

PeterAJansen commented 3 years ago

(A continuation of #10149 , since it looks like it's a broader issue:)

It looks like seq2seq has changed in the past week, and now gives out-of-memory errors for @stas00 's impressive recent DeepSpeed work that allowed training/predicting e.g. T5-11B on a single 40GB card.

Here's a simple repeatable example using the newer scripts:

Run script:

export OUTPUTDIR=tst-summarization
export BS=1; rm -rf $OUTPUTDIR; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./run_seq2seq.py \
    --model_name_or_path allenai/unifiedqa-t5-11b \
    --do_train \
    --do_eval \
    --do_predict \
    --task summarization \
    --dataset_name xsum \
    --output_dir $OUTPUTDIR \
    --per_device_train_batch_size=$BS \
    --per_device_eval_batch_size=$BS \
    --overwrite_output_dir \
    --predict_with_generate \
    --max_train_samples 500 \
    --max_val_samples 100 \
    --max_test_samples 100 \

(One note: Should I be adding a --deepspeed option as with the old finetune_trainer.py (I am not seeing it in the list of options)? And if so, should it be pointing to the new location for the config file ( ../tests/deepspeed/ds_config.json ), or does it use this location by default?)

Conda Environment:

# Make new environment
conda create --name transformers-feb12-2021 python=3.8
conda activate transformers-feb12-2021

# Clone transformers
git clone https://github.com/huggingface/transformers.git
cd transformers

# Install nightly build of Pytorch
pip install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html -U

# Install seq2seq transformers requirements
pip install -r examples/seq2seq/requirements.txt

# Install transformers
pip install -e .

# Install DeepSpeed from source for the A100 support
cd ..
git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed/
# Checkout release for DeepSpeed 0.3.10 (to avoid AMD bug in latest)
git checkout c14b839d9
./install.sh
pip install .

Error:

...
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 2; 39.59 GiB total capacity; 37.87 GiB already allocated; 40.69 MiB free; 37.88 GiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "./run_seq2seq.py", line 629, in <module>
    main()
  File "./run_seq2seq.py", line 543, in main
    trainer = Seq2SeqTrainer(
  File "/home/pajansen/github/transformers-feb12-2021/transformers/src/transformers/trainer.py", line 276, in __init__
    model = model.to(args.device)
  File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)
  File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/torch/nn/modules/module.py", line 671, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 3; 39.59 GiB total capacity; 37.87 GiB already allocated; 40.69 MiB free; 37.88 GiB reserved in total by PyTorch)
stas00 commented 3 years ago

it's there:

./run_seq2seq.py -h | grep deepspeed
                      [--sharded_ddp [SHARDED_DDP]] [--deepspeed DEEPSPEED]
  --deepspeed DEEPSPEED
                        Enable deepspeed and pass the path to deepspeed json

of course, it would OOM w/o --deepspeed in your situation.

and you could just

pip install deepspeed==0.3.10

too ;)

And I don't know if xsum dataset is the same. The one we used with finetune_trainer.py was hand-cured, see: https://github.com/huggingface/transformers/issues/10044 I'm trying to figure out how to make these available through the dataset hub.

PeterAJansen commented 3 years ago

it's there:

./run_seq2seq.py -h | grep deepspeed
                      [--sharded_ddp [SHARDED_DDP]] [--deepspeed DEEPSPEED]
  --deepspeed DEEPSPEED
                        Enable deepspeed and pass the path to deepspeed json

of course, it would OOM w/o --deepspeed in your situation.

Ugh. Sorry, my toddler didn't sleep well last night. Maybe I should just hang up my compiler for the day. Of course I just looked with my eyeballs instead of grep, and it's one of like three lines in the enormous parameter listing with a second parameter on the same line. :)

and you could just

pip install deepspeed==0.3.10

too ;)

I use the ./install.sh script because of that issue with the A100 architecture (80) seemingly not included by default. I haven't followed up to check if that's fixed in the last few weeks.

And I don't know if xsum dataset is the same. The one we used with finetune_trainer.py was hand-cured, see: #10044 I'm trying to figure out how to make these available through the dataset hub.

The behavior when running is a bit different -- I put xsum in the examples/seq2seq folder, but it downloaded a fresh copy from the dataset hub and used it, so that should be okay.

When running with the deepspeed option:

export OUTPUTDIR=tst-summarization
export BS=1; rm -rf $OUTPUTDIR; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./run_seq2seq.py \
    --model_name_or_path allenai/unifiedqa-t5-11b \
    --do_train \
    --do_eval \
    --do_predict \
    --task summarization \
    --dataset_name xsum \
    --output_dir $OUTPUTDIR \
    --per_device_train_batch_size=$BS \
    --per_device_eval_batch_size=$BS \
    --overwrite_output_dir \
    --predict_with_generate \
    --max_train_samples 500 \
    --max_val_samples 100 \
    --max_test_samples 100 \
    --deepspeed ../tests/deepspeed/ds_config.json \

It gets a little further, but then still OOMs:

RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 2; 39.59 GiB total capacity; 36.92 GiB already allocated; 4.69 MiB free; 37.30 GiB reserved in total by PyTorch)
Traceback (most recent call last):
  File "./run_seq2seq.py", line 629, in <module>
    main()
  File "./run_seq2seq.py", line 561, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/pajansen/github/transformers-feb12-2021/transformers/src/transformers/trainer.py", line 960, in train
    tr_loss += self.training_step(model, inputs)
  File "/home/pajansen/github/transformers-feb12-2021/transformers/src/transformers/trainer.py", line 1346, in training_step
    self.deepspeed.backward(loss)
  File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 845, in backward
    self.optimizer.backward(loss)
  File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1603, in backward
    buf_1 = torch.empty(int(self.reduce_bucket_size * 4.5),
RuntimeError: CUDA out of memory. Tried to allocate 1.68 GiB (GPU 1; 39.59 GiB total capacity; 35.88 GiB already allocated; 840.69 MiB free; 36.48 GiB reserved in total by PyTorch)
  0%|▍                                                                                                                                                                  | 1/375 [00:09<58:33,  9.39s/it]

The ds_config.json bucket sizes are 2e8. I'm not sure I've run xsum before, so it's not clear to me if that just needs to be tinkered with (I'll try a few more values, and report back if that solves it).

PeterAJansen commented 3 years ago

(FYI It does look like training works on:

https://github.com/huggingface/transformers/commit/c130e67dce56a092604949a8df6384a17f762189

Confirming your suggestion that the change probably happened in #10114 )

stas00 commented 3 years ago

Thank your validating that, @PeterAJansen. I will research and get back to you hopefully with a better solution.

stas00 commented 3 years ago

Just an update on the new script - I finally managed to get it to produce an equivalent bleu score:

Needed to convert the dataset into jsonlines see https://github.com/huggingface/transformers/issues/10036 and multiple other changes, the most easy to miss (as it won't fail but produce abysmal results) is the one at the end of this comment.

and then the script is:

export BS=16; rm -r output_dir; PYTHONPATH=../../src USE_TF=0 python ./run_seq2seq.py \
--model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 \
--train_file /hf/transformers-master/examples/seq2seq/wmt_en_ro/train.json  \
--validation_file /hf/transformers-master/examples/seq2seq/wmt_en_ro/val.json \
--do_eval --do_train --evaluation_strategy=steps --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step \
--logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir \
--per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000 \
 --sortish_sampler --task translation_en_to_ro  --val_max_target_length 128 --warmup_steps 500 \
--max_train_samples 2000 --max_val_samples 500 --source_prefix "translate English to Romanian: "

Note the important new addition --source_prefix "translate English to Romanian: " - w/o it the score is close to 0 as the new script doesn't translate for t5 automatically - I advocate to change that, but time will show.

I'm not sure if xsum dataset is the same - didn't get to it yet.

So with summarization you most likely need to add --source_prefix "summarize: "

stas00 commented 3 years ago

Further update: I ported the wmt pre-processed data to HF datasets, so now the dataset fetching is automated:

export BS=16; rm -r output_dir; PYTHONPATH=../../src USE_TF=0 python ./run_seq2seq.py \
--model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 \
--do_eval --do_train --evaluation_strategy=steps --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step \
--logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir \
--per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000 \
 --sortish_sampler --task translation_en_to_ro  --val_max_target_length 128 --warmup_steps 500 \
--max_train_samples 2000 --max_val_samples 500 --source_prefix "translate English to Romanian: " \
--dataset_name wmt16-en-ro-pre-processed
stas00 commented 3 years ago

@PeterAJansen, so I have been thinking about that change that I introduced that you discovered made it impossible to eval the 45GB model on 40GB card. But the thing is, before the change, you were using an fp16 version remaining from train - during eval, which from what I understand may not give good accuracy - have you run evaluation and received good results?

I'm trying to see whether the Trainer should support fp16 in eval.

The tricky issue is that currently we switch .to(device) in trainer's init, so this will have to be re-worked somehow. But first I would love to hear if that work on t5-11b quality-wise. model.half() will require only 22GB

As a quick test if you're doing eval only and no training it could be hacked by putting it before switching to gpu:

https://github.com/huggingface/transformers/blob/1c8c2d9ab34b8c8d326db9e0608f8e54cfccb885/src/transformers/trainer.py#L271-L276

PeterAJansen commented 3 years ago

Hmmm, that's a good question. I've been doing exploration on new data, and the generations looked okay by eye, but I don't have a solid metric to automatically evaluate them right now -- so I can't immediately answer the question of whether the results look good.

I've had a long run going for about 5 days that should be done in about 10 hours. Is there a test run that one of us could try then to verify that things look good before I stick the next 5-day batch on? :) (perhaps one of the standard t5 evaluation datasets with known performance?).

stas00 commented 3 years ago

What task and language are you training/finetuning for, so that we can find a way to compare apples to apples, and might be indicative.

And of course the ultimate test is to compare the scores for the same model before and after the finetuning/training on the same test data.

PeterAJansen commented 3 years ago

Mine is a big can of worms (a complex inference task, with the data currently being generated by annotators, with no current automated metrics for evaluation) so we should use something different.

Maybe the WMT task, since it's one of the examples shown in the huggingface seq2seq readme (and the one I used for the example script above to show the bug)? There are published expected results on Table 14 (page 39) in the T5 paper we can use as a guide:

https://arxiv.org/pdf/1910.10683.pdf

stas00 commented 3 years ago

So if you're running many days of training and you have no way of evaluating the quality improvement what is then the point of this exercise? Just to first know that it can be trained? Which is a totally valid exercise.

Surely you could establish at least some baseline, to know even roughly if there is an improvement.

If the data/task is similar to WMT then yes, it'd be useful.

e.g. eval en2ro translation:

export BS=16; rm -r output_dir; PYTHONPATH=../../src USE_TF=0 CUDA_VISIBLE_DEVICES=0 python ./run_seq2seq.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --do_eval --evaluation_strategy=steps  --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro  --val_max_target_length 128 --warmup_steps 500  --max_val_samples 500 --dataset_name wmt16 --dataset_config "ro-en" --source_prefix "translate English to Romanian: "
...
02/16/2021 10:45:50 - INFO - __main__ -   ***** val metrics *****
02/16/2021 10:45:50 - INFO - __main__ -     val_bleu = 24.1257
02/16/2021 10:45:50 - INFO - __main__ -     val_gen_len = 39.554
02/16/2021 10:45:50 - INFO - __main__ -     val_loss = 3.7917
02/16/2021 10:45:50 - INFO - __main__ -     val_runtime = 18.2931
02/16/2021 10:45:50 - INFO - __main__ -     val_samples = 500
02/16/2021 10:45:50 - INFO - __main__ -     val_samples_per_second = 27.333

note that the eval scores are very language pair-specific - the variations between various pairs can be huge.

PeterAJansen commented 3 years ago

The short answer is, I work in an area that doesn't yet have good automated metrics for evaluating generation quality, and so we typically evaluate them manually (which takes a lot of time, typically from research assistants -- part of what we're working on right now is figuring out reasonable automated metrics). But we still know from other earlier work and analyses that we've done that pre-training on related data helps, so that's what I'm doing now (the long early tail of pre-training). While I know that pre-training helps from past work, I can't easily evaluate it online -- I have to run the set, then evaluate it manually.

But all that is unrelated to the original question, whether T5-11B fp16 evaluation (in general, not paired to a specific dataset) has an issue or works okay relative to fp32:

@PeterAJansen, so I have been thinking about that change that I introduced that you discovered made it impossible to eval the 45GB model on 40GB card. But the thing is, before the change, you were using an fp16 version remaining from train - during eval, which from what I understand may not give good accuracy - have you run evaluation and received good results?

I'm trying to see whether the Trainer should support fp16 in eval.

To figure that out, we won't be able to use my lab's dataset for various technical reasons, so if there's some minimum benchmarking dataset that helps measure this that works well with automated evaluation, then that would be best to use. :)

stas00 commented 3 years ago

Thank you for elucidating your particular situation, @PeterAJansen

I'm going to run some experiments on fp16 eval against fp32 for t5 w/ wmt and we shall see. If it works well, then we can make fp16-eval available in the Trainer for those who want to try it.

PeterAJansen commented 3 years ago

Interesting and possibly related bug (on c130e67):

1) Fune-tuning T5-11B from the model hub (and saving it as. e.g. Model2) works 2) Subsequently further fine-tuning Model 2 (loaded from disk) on different data appears to OOM.

stas00 commented 3 years ago

Yes, there are a few places where model.to(self.args.device) is called, does the OOM go away if you disable them all - I think there 2 more that aren't conditioned on deepspeed.

Most likely I need to go over and replicated each place where it's done for self.is_model_parallel since it's the same circumstances where we don't want the model to be on device right away.

Also what was the specific 2nd command line? so that I can add a test

Thank you.

stas00 commented 3 years ago

This:

diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index 8afae0720..cda1a2822 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -792,7 +792,7 @@ class Trainer:

         # If model was re-initialized, put it on the right device and update self.model_wrapped
         if model_reloaded:
-            if not self.is_model_parallel and self.args.place_model_on_device:
+            if not (self.is_model_parallel or (args.deepspeed and args.do_train)) and self.args.place_model_on_device:
                 self.model = self.model.to(self.args.device)
             self.model_wrapped = self.model

@@ -1045,7 +1045,7 @@ class Trainer:
             )
             if isinstance(self.model, PreTrainedModel):
                 self.model = self.model.from_pretrained(self.state.best_model_checkpoint)
-                if not self.is_model_parallel and self.args.place_model_on_device:
+                if not (self.is_model_parallel or (args.deepspeed and args.do_train)) and self.args.place_model_on_device:
                     self.model = self.model.to(self.args.device)
             else:
                 state_dict = torch.load(os.path.join(self.state.best_model_checkpoint, WEIGHTS_NAME))
PeterAJansen commented 3 years ago

Thanks! I hope to be able to give this diff a test tonight when the current run is done (about 10h left).

Also what was the specific 2nd command line? so that I can add a test

Here are two cases (my exact script, but a distilled version that matches the WMT example at the top of this issue from the readme):

  1. Here is my exact script that I'm using for my experment (the two MODELDIR exports at the top being the critical difference between it working or not working -- the one currently selected is just the output of a past run of this script pointing to different training data):
    
    #!/bin/bash
    export DATADIR=/home/pajansen/github/compositional-expl/pretrain/min-6-max-8/ \
    export MODELDIR=allenai/unifiedqa-t5-11b
    #export MODELDIR=output_dir_compexpl-feb8-epoch3-uqa-11b-pretrain-teacher-min4-max5
    export SEQLEN=256 \
    export EPOCHS=3 \
    export OUTPUTDIR=output_dir_compexpl-feb16-epoch$EPOCHSS-uqa-11b-pretrain-teacher-min6-max8 \

export BS=1; rm -rf $OUTPUTDIR; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path $MODELDIR --output_dir $OUTPUTDIR --adam_eps 1e-06 --data_dir $DATADIR \ --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \ --logging_first_step --logging_steps 5000 --max_source_length $SEQLEN --max_target_length $SEQLEN --num_train_epochs $EPOCHS \ --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS \ --predict_with_generate --sortish_sampler \ --test_max_target_length $SEQLEN --val_max_target_length $SEQLEN \ --warmup_steps 5 \ --deepspeed ../tests/deepspeed/ds_config.json --fp16 \ --save_total_limit 2 \ --save_steps 5000 \


2. But, here's a distilled version, using the WMT example, that should illustrate the issue (but I haven't run this one). The call is identical here, it's just the OUTPUTDIRx and MODELDIRx environment variables that change (though in practice, like above, you'd want to change the data you're fine tuning with, too):

Step 1: Fine-tune base model with dataset 1

export OUTPUTDIR1=tst-summarization-step1 export MODELDIR1=allenai/unifiedqa-t5-11b export BS=1; rm -rf $OUTPUTDIR1; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./run_seq2seq.py \ --model_name_or_path $MODELDIR1 \ --do_train \ --do_eval \ --do_predict \ --task summarization \ --dataset_name xsum \ --output_dir $OUTPUTDIR \ --per_device_train_batch_size=$BS \ --per_device_eval_batch_size=$BS \ --overwrite_output_dir \ --predict_with_generate \ --max_train_samples 500 \ --max_val_samples 100 \ --max_test_samples 100 \

Step 2: Further fine-tune model saved in Step 1 with new data

Also pretend that the dataset_name is different here (suggesting fine-tuning the model from Step 1 using a different dataset -- but just for the test, fine-tuning twice on the same dataset should illustrate the OOM issue)

export OUTPUTDIR2=tst-summarization-step2 export MODELDIR2=$OUTPUTDIR1 export BS=1; rm -rf $OUTPUTDIR2; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./run_seq2seq.py \ --model_name_or_path $MODELDIR2 \ --do_train \ --do_eval \ --do_predict \ --task summarization \ --dataset_name xsum \ --output_dir $OUTPUTDIR \ --per_device_train_batch_size=$BS \ --per_device_eval_batch_size=$BS \ --overwrite_output_dir \ --predict_with_generate \ --max_train_samples 500 \ --max_val_samples 100 \ --max_test_samples 100 \

stas00 commented 3 years ago

Thank you for the details, @PeterAJansen - hoping to validate later in the day, but meanwhile this PR should solve it https://github.com/huggingface/transformers/pull/10243 (i.e. instead of the patch I sent last night).

edit PR merged, so master should be OK.

stas00 commented 3 years ago

Questions:

  1. This is with non-master version but then one before the fateful PR of mine, correct? since eval currently won't fit 45GB onto 22GB - I'm working on a solution.
  2. can you check if the saved model is bigger than the original? my feeling is that something else gets tacked onto the model that wasn't there in the original.

    I developed a new memory usage metrics feature: https://github.com/huggingface/transformers/pull/10225 so that should make it possible to identify and debug such problems on a much smaller model. You will probably find it useful too.

    So I should be well equipped to run your failing scenario now.

stas00 commented 3 years ago

FYI, master has a new Trainer flag --fp16_full_eval https://github.com/huggingface/transformers/pull/10268 so now you should be able to eval at fp16 and be able to fit t5-11b onto 40gb gpu. It may or may not do what you want quality-wise, since model.half() doesn't always produce the desired results. But it does restore the original deepspeed/trainer non-deepspeed eval ability to fit in fp16.

Still need to check on your 2 step scenario OOM report, @PeterAJansen

stas00 commented 3 years ago

another update: DS currently locks one in if one wants to be able to access the fp32 model, see https://github.com/microsoft/DeepSpeed/issues/797 once they add a method to extract the fp32 model https://github.com/microsoft/DeepSpeed/issues/800 then we can sort this out.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.