Closed PeterAJansen closed 3 years ago
it's there:
./run_seq2seq.py -h | grep deepspeed
[--sharded_ddp [SHARDED_DDP]] [--deepspeed DEEPSPEED]
--deepspeed DEEPSPEED
Enable deepspeed and pass the path to deepspeed json
of course, it would OOM w/o --deepspeed
in your situation.
and you could just
pip install deepspeed==0.3.10
too ;)
And I don't know if xsum
dataset is the same. The one we used with finetune_trainer.py
was hand-cured, see: https://github.com/huggingface/transformers/issues/10044 I'm trying to figure out how to make these available through the dataset hub.
it's there:
./run_seq2seq.py -h | grep deepspeed [--sharded_ddp [SHARDED_DDP]] [--deepspeed DEEPSPEED] --deepspeed DEEPSPEED Enable deepspeed and pass the path to deepspeed json
of course, it would OOM w/o
--deepspeed
in your situation.
Ugh. Sorry, my toddler didn't sleep well last night. Maybe I should just hang up my compiler for the day. Of course I just looked with my eyeballs instead of grep, and it's one of like three lines in the enormous parameter listing with a second parameter on the same line. :)
and you could just
pip install deepspeed==0.3.10
too ;)
I use the ./install.sh script because of that issue with the A100 architecture (80) seemingly not included by default. I haven't followed up to check if that's fixed in the last few weeks.
And I don't know if
xsum
dataset is the same. The one we used withfinetune_trainer.py
was hand-cured, see: #10044 I'm trying to figure out how to make these available through the dataset hub.
The behavior when running is a bit different -- I put xsum in the examples/seq2seq folder, but it downloaded a fresh copy from the dataset hub and used it, so that should be okay.
When running with the deepspeed option:
export OUTPUTDIR=tst-summarization
export BS=1; rm -rf $OUTPUTDIR; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./run_seq2seq.py \
--model_name_or_path allenai/unifiedqa-t5-11b \
--do_train \
--do_eval \
--do_predict \
--task summarization \
--dataset_name xsum \
--output_dir $OUTPUTDIR \
--per_device_train_batch_size=$BS \
--per_device_eval_batch_size=$BS \
--overwrite_output_dir \
--predict_with_generate \
--max_train_samples 500 \
--max_val_samples 100 \
--max_test_samples 100 \
--deepspeed ../tests/deepspeed/ds_config.json \
It gets a little further, but then still OOMs:
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 2; 39.59 GiB total capacity; 36.92 GiB already allocated; 4.69 MiB free; 37.30 GiB reserved in total by PyTorch)
Traceback (most recent call last):
File "./run_seq2seq.py", line 629, in <module>
main()
File "./run_seq2seq.py", line 561, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/pajansen/github/transformers-feb12-2021/transformers/src/transformers/trainer.py", line 960, in train
tr_loss += self.training_step(model, inputs)
File "/home/pajansen/github/transformers-feb12-2021/transformers/src/transformers/trainer.py", line 1346, in training_step
self.deepspeed.backward(loss)
File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 845, in backward
self.optimizer.backward(loss)
File "/home/pajansen/anaconda3/envs/transformers-feb12-2021/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1603, in backward
buf_1 = torch.empty(int(self.reduce_bucket_size * 4.5),
RuntimeError: CUDA out of memory. Tried to allocate 1.68 GiB (GPU 1; 39.59 GiB total capacity; 35.88 GiB already allocated; 840.69 MiB free; 36.48 GiB reserved in total by PyTorch)
0%|▍ | 1/375 [00:09<58:33, 9.39s/it]
The ds_config.json bucket sizes are 2e8. I'm not sure I've run xsum before, so it's not clear to me if that just needs to be tinkered with (I'll try a few more values, and report back if that solves it).
(FYI It does look like training works on:
https://github.com/huggingface/transformers/commit/c130e67dce56a092604949a8df6384a17f762189
Confirming your suggestion that the change probably happened in #10114 )
Thank your validating that, @PeterAJansen. I will research and get back to you hopefully with a better solution.
Just an update on the new script - I finally managed to get it to produce an equivalent bleu score:
Needed to convert the dataset into jsonlines
see https://github.com/huggingface/transformers/issues/10036 and multiple other changes, the most easy to miss (as it won't fail but produce abysmal results) is the one at the end of this comment.
and then the script is:
export BS=16; rm -r output_dir; PYTHONPATH=../../src USE_TF=0 python ./run_seq2seq.py \
--model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 \
--train_file /hf/transformers-master/examples/seq2seq/wmt_en_ro/train.json \
--validation_file /hf/transformers-master/examples/seq2seq/wmt_en_ro/val.json \
--do_eval --do_train --evaluation_strategy=steps --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step \
--logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir \
--per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000 \
--sortish_sampler --task translation_en_to_ro --val_max_target_length 128 --warmup_steps 500 \
--max_train_samples 2000 --max_val_samples 500 --source_prefix "translate English to Romanian: "
Note the important new addition --source_prefix "translate English to Romanian: "
- w/o it the score is close to 0 as the new script doesn't translate for t5 automatically - I advocate to change that, but time will show.
I'm not sure if xsum
dataset is the same - didn't get to it yet.
So with summarization you most likely need to add --source_prefix "summarize: "
Further update: I ported the wmt pre-processed data to HF datasets
, so now the dataset fetching is automated:
export BS=16; rm -r output_dir; PYTHONPATH=../../src USE_TF=0 python ./run_seq2seq.py \
--model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 \
--do_eval --do_train --evaluation_strategy=steps --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step \
--logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir \
--per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000 \
--sortish_sampler --task translation_en_to_ro --val_max_target_length 128 --warmup_steps 500 \
--max_train_samples 2000 --max_val_samples 500 --source_prefix "translate English to Romanian: " \
--dataset_name wmt16-en-ro-pre-processed
@PeterAJansen, so I have been thinking about that change that I introduced that you discovered made it impossible to eval the 45GB model on 40GB card. But the thing is, before the change, you were using an fp16 version remaining from train - during eval, which from what I understand may not give good accuracy - have you run evaluation and received good results?
I'm trying to see whether the Trainer should support fp16 in eval.
The tricky issue is that currently we switch .to(device)
in trainer's init, so this will have to be re-worked somehow. But first I would love to hear if that work on t5-11b quality-wise. model.half()
will require only 22GB
As a quick test if you're doing eval
only and no training it could be hacked by putting it before switching to gpu:
Hmmm, that's a good question. I've been doing exploration on new data, and the generations looked okay by eye, but I don't have a solid metric to automatically evaluate them right now -- so I can't immediately answer the question of whether the results look good.
I've had a long run going for about 5 days that should be done in about 10 hours. Is there a test run that one of us could try then to verify that things look good before I stick the next 5-day batch on? :) (perhaps one of the standard t5 evaluation datasets with known performance?).
What task and language are you training/finetuning for, so that we can find a way to compare apples to apples, and might be indicative.
And of course the ultimate test is to compare the scores for the same model before and after the finetuning/training on the same test data.
Mine is a big can of worms (a complex inference task, with the data currently being generated by annotators, with no current automated metrics for evaluation) so we should use something different.
Maybe the WMT task, since it's one of the examples shown in the huggingface seq2seq readme (and the one I used for the example script above to show the bug)? There are published expected results on Table 14 (page 39) in the T5 paper we can use as a guide:
So if you're running many days of training and you have no way of evaluating the quality improvement what is then the point of this exercise? Just to first know that it can be trained? Which is a totally valid exercise.
Surely you could establish at least some baseline, to know even roughly if there is an improvement.
If the data/task is similar to WMT then yes, it'd be useful.
e.g. eval en2ro translation:
export BS=16; rm -r output_dir; PYTHONPATH=../../src USE_TF=0 CUDA_VISIBLE_DEVICES=0 python ./run_seq2seq.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06 --do_eval --evaluation_strategy=steps --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --predict_with_generate --eval_steps 25000 --sortish_sampler --task translation_en_to_ro --val_max_target_length 128 --warmup_steps 500 --max_val_samples 500 --dataset_name wmt16 --dataset_config "ro-en" --source_prefix "translate English to Romanian: "
...
02/16/2021 10:45:50 - INFO - __main__ - ***** val metrics *****
02/16/2021 10:45:50 - INFO - __main__ - val_bleu = 24.1257
02/16/2021 10:45:50 - INFO - __main__ - val_gen_len = 39.554
02/16/2021 10:45:50 - INFO - __main__ - val_loss = 3.7917
02/16/2021 10:45:50 - INFO - __main__ - val_runtime = 18.2931
02/16/2021 10:45:50 - INFO - __main__ - val_samples = 500
02/16/2021 10:45:50 - INFO - __main__ - val_samples_per_second = 27.333
note that the eval scores are very language pair-specific - the variations between various pairs can be huge.
The short answer is, I work in an area that doesn't yet have good automated metrics for evaluating generation quality, and so we typically evaluate them manually (which takes a lot of time, typically from research assistants -- part of what we're working on right now is figuring out reasonable automated metrics). But we still know from other earlier work and analyses that we've done that pre-training on related data helps, so that's what I'm doing now (the long early tail of pre-training). While I know that pre-training helps from past work, I can't easily evaluate it online -- I have to run the set, then evaluate it manually.
But all that is unrelated to the original question, whether T5-11B fp16 evaluation (in general, not paired to a specific dataset) has an issue or works okay relative to fp32:
@PeterAJansen, so I have been thinking about that change that I introduced that you discovered made it impossible to eval the 45GB model on 40GB card. But the thing is, before the change, you were using an fp16 version remaining from train - during eval, which from what I understand may not give good accuracy - have you run evaluation and received good results?
I'm trying to see whether the Trainer should support fp16 in eval.
To figure that out, we won't be able to use my lab's dataset for various technical reasons, so if there's some minimum benchmarking dataset that helps measure this that works well with automated evaluation, then that would be best to use. :)
Thank you for elucidating your particular situation, @PeterAJansen
I'm going to run some experiments on fp16 eval against fp32 for t5 w/ wmt and we shall see. If it works well, then we can make fp16-eval available in the Trainer for those who want to try it.
Interesting and possibly related bug (on c130e67):
1) Fune-tuning T5-11B from the model hub (and saving it as. e.g. Model2) works 2) Subsequently further fine-tuning Model 2 (loaded from disk) on different data appears to OOM.
Yes, there are a few places where model.to(self.args.device)
is called, does the OOM go away if you disable them all - I think there 2 more that aren't conditioned on deepspeed
.
Most likely I need to go over and replicated each place where it's done for self.is_model_parallel
since it's the same circumstances where we don't want the model to be on device right away.
Also what was the specific 2nd command line? so that I can add a test
Thank you.
This:
diff --git a/src/transformers/trainer.py b/src/transformers/trainer.py
index 8afae0720..cda1a2822 100755
--- a/src/transformers/trainer.py
+++ b/src/transformers/trainer.py
@@ -792,7 +792,7 @@ class Trainer:
# If model was re-initialized, put it on the right device and update self.model_wrapped
if model_reloaded:
- if not self.is_model_parallel and self.args.place_model_on_device:
+ if not (self.is_model_parallel or (args.deepspeed and args.do_train)) and self.args.place_model_on_device:
self.model = self.model.to(self.args.device)
self.model_wrapped = self.model
@@ -1045,7 +1045,7 @@ class Trainer:
)
if isinstance(self.model, PreTrainedModel):
self.model = self.model.from_pretrained(self.state.best_model_checkpoint)
- if not self.is_model_parallel and self.args.place_model_on_device:
+ if not (self.is_model_parallel or (args.deepspeed and args.do_train)) and self.args.place_model_on_device:
self.model = self.model.to(self.args.device)
else:
state_dict = torch.load(os.path.join(self.state.best_model_checkpoint, WEIGHTS_NAME))
Thanks! I hope to be able to give this diff a test tonight when the current run is done (about 10h left).
Also what was the specific 2nd command line? so that I can add a test
Here are two cases (my exact script, but a distilled version that matches the WMT example at the top of this issue from the readme):
#!/bin/bash
export DATADIR=/home/pajansen/github/compositional-expl/pretrain/min-6-max-8/ \
export MODELDIR=allenai/unifiedqa-t5-11b
#export MODELDIR=output_dir_compexpl-feb8-epoch3-uqa-11b-pretrain-teacher-min4-max5
export SEQLEN=256 \
export EPOCHS=3 \
export OUTPUTDIR=output_dir_compexpl-feb16-epoch$EPOCHSS-uqa-11b-pretrain-teacher-min6-max8 \
export BS=1; rm -rf $OUTPUTDIR; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./finetune_trainer.py --model_name_or_path $MODELDIR --output_dir $OUTPUTDIR --adam_eps 1e-06 --data_dir $DATADIR \ --do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \ --logging_first_step --logging_steps 5000 --max_source_length $SEQLEN --max_target_length $SEQLEN --num_train_epochs $EPOCHS \ --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS \ --predict_with_generate --sortish_sampler \ --test_max_target_length $SEQLEN --val_max_target_length $SEQLEN \ --warmup_steps 5 \ --deepspeed ../tests/deepspeed/ds_config.json --fp16 \ --save_total_limit 2 \ --save_steps 5000 \
2. But, here's a distilled version, using the WMT example, that should illustrate the issue (but I haven't run this one). The call is identical here, it's just the OUTPUTDIRx and MODELDIRx environment variables that change (though in practice, like above, you'd want to change the data you're fine tuning with, too):
export OUTPUTDIR1=tst-summarization-step1 export MODELDIR1=allenai/unifiedqa-t5-11b export BS=1; rm -rf $OUTPUTDIR1; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./run_seq2seq.py \ --model_name_or_path $MODELDIR1 \ --do_train \ --do_eval \ --do_predict \ --task summarization \ --dataset_name xsum \ --output_dir $OUTPUTDIR \ --per_device_train_batch_size=$BS \ --per_device_eval_batch_size=$BS \ --overwrite_output_dir \ --predict_with_generate \ --max_train_samples 500 \ --max_val_samples 100 \ --max_test_samples 100 \
export OUTPUTDIR2=tst-summarization-step2 export MODELDIR2=$OUTPUTDIR1 export BS=1; rm -rf $OUTPUTDIR2; PYTHONPATH=../../src USE_TF=0 /usr/bin/time -v deepspeed --num_gpus=4 ./run_seq2seq.py \ --model_name_or_path $MODELDIR2 \ --do_train \ --do_eval \ --do_predict \ --task summarization \ --dataset_name xsum \ --output_dir $OUTPUTDIR \ --per_device_train_batch_size=$BS \ --per_device_eval_batch_size=$BS \ --overwrite_output_dir \ --predict_with_generate \ --max_train_samples 500 \ --max_val_samples 100 \ --max_test_samples 100 \
Thank you for the details, @PeterAJansen - hoping to validate later in the day, but meanwhile this PR should solve it https://github.com/huggingface/transformers/pull/10243 (i.e. instead of the patch I sent last night).
edit PR merged, so master should be OK.
Questions:
eval
currently won't fit 45GB onto 22GB - I'm working on a solution.can you check if the saved model is bigger than the original? my feeling is that something else gets tacked onto the model that wasn't there in the original.
I developed a new memory usage metrics feature: https://github.com/huggingface/transformers/pull/10225 so that should make it possible to identify and debug such problems on a much smaller model. You will probably find it useful too.
So I should be well equipped to run your failing scenario now.
FYI, master has a new Trainer flag --fp16_full_eval
https://github.com/huggingface/transformers/pull/10268 so now you should be able to eval at fp16 and be able to fit t5-11b onto 40gb gpu. It may or may not do what you want quality-wise, since model.half()
doesn't always produce the desired results. But it does restore the original deepspeed/trainer non-deepspeed eval ability to fit in fp16.
Still need to check on your 2 step scenario OOM report, @PeterAJansen
another update: DS currently locks one in if one wants to be able to access the fp32 model, see https://github.com/microsoft/DeepSpeed/issues/797 once they add a method to extract the fp32 model https://github.com/microsoft/DeepSpeed/issues/800 then we can sort this out.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
(A continuation of #10149 , since it looks like it's a broader issue:)
It looks like seq2seq has changed in the past week, and now gives out-of-memory errors for @stas00 's impressive recent DeepSpeed work that allowed training/predicting e.g. T5-11B on a single 40GB card.
Here's a simple repeatable example using the newer scripts:
Run script:
(One note: Should I be adding a --deepspeed option as with the old finetune_trainer.py (I am not seeing it in the list of options)? And if so, should it be pointing to the new location for the config file ( ../tests/deepspeed/ds_config.json ), or does it use this location by default?)
Conda Environment:
Error: