Closed yuvalkirstain closed 3 years ago
Hi I also observe the similar issue with mt5 models, https://github.com/huggingface/transformers/issues/10819 , deepspeed is still not working for me due to this issue with mt5 models. I greatly appreciate having a look @patil-suraj @patrickvonplaten
We didn't really manage to resolve the problems with t5/mt5 + mixed precision fp16 (cc @patil-suraj). I'm not sure whether anybody has tried internally to fine-tune t5/mt5 with deepspeed (@stas00 maybe?)
the issue arises without deepspeed, just vanilla mt5-small model. Also, I see similar nans with deepspeed with a model based on mt5-small slightly modified, please see the issue here https://github.com/huggingface/transformers/issues/10821#issuecomment-803453998, I think if the issue with fp16 option could get resolved, hopefully this will be also more stable with model changes in deepspeed as well. Thanks a lot.
Indeed, this has nothing to do with deepspeed, other than that deepspeed trains in mixed precision and evals in full fp16 at the moment.
I've started studying the bfloat16 vs. float16 numerical properties and their correlation to each other. And once I understand it well I will try to see if there some sort of magical remapping that perhaps could be done - this is my fantasy of course. I just need to finish a few other more urgent things with deepspeed stage3 integration first.
But please don't let my comment prevent you from merging the proposed fix if it already solves the problem.
I got similar issue with mt5 model, @patrickvonplaten thanks a lot in advance for your help
@dorost1234 + @yuvalkirstain, please kindly try this branch:
https://github.com/huggingface/transformers/tree/t5-fp16-no-nans
and let me know if it solves the problem - It seems that the problem is due to autocast
in T5LayerFF
so this branch tries to turn off autocast
just for that layer. It also disables the previously added clamping.
There is also a lot of debug statements in the branch but they will be silent unless nan/inf is detected.
I tested it work on a small sample with t5-small/t5-base/t5-large/google/mt5-small.
The main part of the fix is just:
class T5LayerFF(nn.Module):
def forward(self, hidden_states):
with torch.cuda.amp.autocast(enabled=False):
forwarded_states = self.layer_norm(hidden_states)
forwarded_states = self.DenseReluDense(forwarded_states)
hidden_states = hidden_states + self.dropout(forwarded_states)
return hidden_states
and removing some code. So use the branch first.
If it works I guess we could just monkey patch this version for AMP or come up with some cleaner solution. Probably with torch.is_autocast_enabled()
check
Dear @stas00 Thank you very much for taking time looking into this issue, this would be really awesome if this could fix the issue, I tried to test it, for this I got the branch, and then I install it locally with "python setup.py develop", then I run this command:
python run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir /temp/test --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --logging_step 10 --fp16
I got this error:
Traceback (most recent call last):
File "run_translation.py", line 562, in <module>
main()
File "run_translation.py", line 448, in main
pad_to_multiple_of=8 if training_args.fp16 else None,
TypeError: __init__() got an unexpected keyword argument 'model'
I think there is some version mismatch. I removed the model from input to the collator, as below
data_collator = DataCollatorForSeq2Seq(
tokenizer,
#model=model,
label_pad_token_id=label_pad_token_id,
pad_to_multiple_of=8 if training_args.fp16 else None,
)
and then here is what I got with fp16 option:
{'loss': 23.3523, 'learning_rate': 4.999890767684712e-05, 'epoch': 0.0}
{'loss': 22.5557, 'learning_rate': 4.999781535369424e-05, 'epoch': 0.0}
{'loss': 25.9471, 'learning_rate': 4.999672303054136e-05, 'epoch': 0.0}
{'loss': 23.0994, 'learning_rate': 4.9995630707388475e-05, 'epoch': 0.0}
{'loss': 24.9974, 'learning_rate': 4.999453838423559e-05, 'epoch': 0.0}
{'loss': 23.3743, 'learning_rate': 4.999344606108271e-05, 'epoch': 0.0}
{'loss': 24.2147, 'learning_rate': 4.999235373792983e-05, 'epoch': 0.0}
{'loss': 26.7845, 'learning_rate': 4.9991261414776954e-05, 'epoch': 0.0}
{'loss': 25.2277, 'learning_rate': 4.9990169091624065e-05, 'epoch': 0.0}
{'loss': 23.3156, 'learning_rate': 4.998907676847119e-05, 'epoch': 0.0}
{'loss': 21.275, 'learning_rate': 4.99879844453183e-05, 'epoch': 0.0}
{'loss': 23.7031, 'learning_rate': 4.9986892122165426e-05, 'epoch': 0.0}
{'loss': 23.8086, 'learning_rate': 4.9985799799012544e-05, 'epoch': 0.0}
{'loss': 25.8143, 'learning_rate': 4.998470747585966e-05, 'epoch': 0.0}
{'loss': 24.4319, 'learning_rate': 4.998361515270678e-05, 'epoch': 0.0}
{'loss': 26.8277, 'learning_rate': 4.99825228295539e-05, 'epoch': 0.0}
here is loss without fp16:
{'loss': 27.0258, 'learning_rate': 4.999890767684712e-05, 'epoch': 0.0}
{'loss': 23.141, 'learning_rate': 4.999781535369424e-05, 'epoch': 0.0}
{'loss': 21.2312, 'learning_rate': 4.999672303054136e-05, 'epoch': 0.0}
{'loss': 19.3567, 'learning_rate': 4.9995630707388475e-05, 'epoch': 0.0}
{'loss': 18.7998, 'learning_rate': 4.999453838423559e-05, 'epoch': 0.0}
{'loss': 17.9632, 'learning_rate': 4.999344606108271e-05, 'epoch': 0.0}
{'loss': 17.2105, 'learning_rate': 4.999235373792983e-05, 'epoch': 0.0}
{'loss': 17.5506, 'learning_rate': 4.9991261414776954e-05, 'epoch': 0.0}
{'loss': 15.2566, 'learning_rate': 4.9990169091624065e-05, 'epoch': 0.0}
{'loss': 14.8667, 'learning_rate': 4.998907676847119e-05, 'epoch': 0.0}
{'loss': 13.7132, 'learning_rate': 4.99879844453183e-05, 'epoch': 0.0}
{'loss': 13.4058, 'learning_rate': 4.9986892122165426e-05, 'epoch': 0.0
So I think this is not optimizing the loss well. I greatly appreciate having a look. Thanks a lot.
re errors - this is all on master - the source code and run_translation.py
. When you install pip install -e .
sometimes conda/pip don't clean up an old install, so it helps to do pip uninstall transformers -y
at least 2 times!
I solve such problems by running locally and not relying on the installed transformers
, i.e.:
git clone https://github.com/huggingface/transformers
cd transformers
PYTHONPATH=src python examples/seq2seq/run_translation.py ...
now you never need to worry about what transformers
version is installed in the environment.
wrt not getting the loss going down - this is odd, I just run your code:
PYTHONPATH=src python examples/seq2seq/run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir /tmp/test --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --logging_step 10 --fp16
{'loss': 29.7519, 'learning_rate': 4.999781535369424e-05, 'epoch': 0.0}
{'loss': 26.3593, 'learning_rate': 4.9995630707388475e-05, 'epoch': 0.0}
{'loss': 23.4431, 'learning_rate': 4.999344606108271e-05, 'epoch': 0.0}
{'loss': 21.431, 'learning_rate': 4.9991261414776954e-05, 'epoch': 0.0}
{'loss': 19.2445, 'learning_rate': 4.998907676847119e-05, 'epoch': 0.0}
{'loss': 17.8293, 'learning_rate': 4.9986892122165426e-05, 'epoch': 0.0}
{'loss': 16.9441, 'learning_rate': 4.998470747585966e-05, 'epoch': 0.0}
{'loss': 15.7572, 'learning_rate': 4.99825228295539e-05, 'epoch': 0.0}
{'loss': 15.2937, 'learning_rate': 4.9980338183248135e-05, 'epoch': 0.0}
{'loss': 14.4368, 'learning_rate': 4.997815353694237e-05, 'epoch': 0.0}
{'loss': 14.6709, 'learning_rate': 4.997596889063661e-05, 'epoch': 0.0}
{'loss': 13.2806, 'learning_rate': 4.9973784244330843e-05, 'epoch': 0.0}
{'loss': 12.9245, 'learning_rate': 4.997159959802508e-05, 'epoch': 0.0}
{'loss': 12.4647, 'learning_rate': 4.9969414951719316e-05, 'epoch': 0.0}
{'loss': 11.4738, 'learning_rate': 4.996723030541355e-05, 'epoch': 0.0}
Must be your hardware? Try to lower the learning rate?
I tried with 1 or 2 gpus and it worked in both cases.
Hi @stas00 thank you very much for the pointers, I did it as you mentioned and now I see this is going down nicely
{'loss': 28.1802, 'learning_rate': 4.999890767684712e-05, 'epoch': 0.0}
{'loss': 27.4353, 'learning_rate': 4.999781535369424e-05, 'epoch': 0.0}
{'loss': 21.3904, 'learning_rate': 4.999672303054136e-05, 'epoch': 0.0}
{'loss': 22.8854, 'learning_rate': 4.9995630707388475e-05, 'epoch': 0.0}
{'loss': 19.6943, 'learning_rate': 4.999453838423559e-05, 'epoch': 0.0}
{'loss': 21.253, 'learning_rate': 4.999344606108271e-05, 'epoch': 0.0}
{'loss': 20.1937, 'learning_rate': 4.999235373792983e-05, 'epoch': 0.0}
{'loss': 18.6606, 'learning_rate': 4.9991261414776954e-05, 'epoch': 0.0}
{'loss': 18.0337, 'learning_rate': 4.9990169091624065e-05, 'epoch': 0.0}
{'loss': 16.1259, 'learning_rate': 4.998907676847119e-05, 'epoch': 0.0}
{'loss': 15.4007, 'learning_rate': 4.99879844453183e-05, 'epoch': 0.0}
{'loss': 15.6753, 'learning_rate': 4.9986892122165426e-05, 'epoch': 0.0}
{'loss': 15.0481, 'learning_rate': 4.9985799799012544e-05, 'epoch': 0.0}
{'loss': 14.5833, 'learning_rate': 4.998470747585966e-05, 'epoch': 0.0}
{'loss': 14.0758, 'learning_rate': 4.998361515270678e-05, 'epoch': 0.0}
{'loss': 13.7096, 'learning_rate': 4.99825228295539e-05, 'epoch': 0.0}
{'loss': 13.3216, 'learning_rate': 4.998143050640102e-05, 'epoch': 0.0}
{'loss': 13.2331, 'learning_rate': 4.9980338183248135e-05, 'epoch': 0.0}
{'loss': 12.1556, 'learning_rate': 4.997924586009525e-05, 'epoch': 0.0}
This is such a great, wonderful, amazing fix. Looking forward to using it when this is pushed to the repository. For all the hard problems, you are our only hope @stas00 Thank you very much for this great fix.
Thank you for your kind words, I'm so happy to hear that it worked, @dorost1234.
I will make a proper PR after I clean this branch up.
@yuvalkirstain, please kindly test if this PR fixes the problem: https://github.com/huggingface/transformers/pull/10956
Thank you @stas00 ! It seems to work were my proposed fix failed with T5-Small. I will now run some additional experiments with T5-Large and update.
Thank you for validating that, @yuvalkirstain!
Indeed, I tried first local fixes but the problem would just pop-up elsewhere.
I'm just thinking that perhaps we could find if it's all calls to FF that lead to the problem or only some of them, and then we could optimize the solution I proposed by only disabling autocast
in some cases and not all. I haven't tested that yet.
If you experiment I recommend for you to try my branch, since I left the "detector" on and it'll immediately tell you when the first inf
is encountered.
What I'm most interested in is some longer runs to ensure it doesn't start overflowing at a later point.
Thank you for your contribution.
Finetuned T5-Base using this branch with the standard T5 finetuning HPs on NQ (except from batch_size - used only ~26k tokens) and didn't get nans (it has been running for over 3 hours and training converged). Thanks again, I guess the issue can be closed for time being.
Thank you for this validation, @yuvalkirstain. I still would like to see if we can find a more efficient solution before merging it, but this is great that we have one that works.
This unfortunately doesn't help with deepspeed since it doesn't use pytorch AMP and has its own version, but which doesn't use context manager so can't be turned off locally like autocast
. So we hope to find a different solution.
I linked this issue to the PR so it'll get closed automatically when it's merged.
Well, the nans are back.
T5LayerFF: 1 has inf T5LayerNorm has inf T5LayerNorm variance has inf T5LayerNorm hidden_states has nans T5LayerNorm hidden_states before return has nans T5LayerFF: 2 has nans T5LayerFF: 3 has nans T5LayerFF: 5 has nans T5Block after T5LayerFF has nans T5Stack loop end has nans T5LayerNorm has nans T5LayerNorm variance has nans T5LayerNorm hidden_states has nans T5LayerNorm hidden_states before return has nans
The model I used here was T5-large-ssm-nqo. @stas00 If you'd like to replicate I can send the relevant training file + command.
Yes, please, I'm working in parallel on gpt-neo that has the same issues, so the more reproducible cases we have the higher are the chances we can find a solid fix.
Also those would be good candidates for tests (hoping that we can find a quick way to get to overflow).
Let's continue the discussion in the PR that is trying to solve this issue: https://github.com/huggingface/transformers/pull/10956
@dorost1234 hI, Could you please tell me how you solved this loss optimization problem. I am facing same issue
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
So is this fix now in the main version of transformers?
I found that results are different when you load like this: (first is better)
model1a_CPU = T5ForConditionalGeneration.from_pretrained(best_model_path, low_cpu_mem_usage=True,torch_dtype=torch.float16).to("cuda")
than when you load via:
model1a_CPU = T5ForConditionalGeneration.from_pretrained(best_model_path, low_cpu_mem_usage=True)
model1a_CPU.half()
model1a_CPU.eval()
model1a_CPU.to("cuda")
So this could be a solution, I will compare result on /CPU versus /This versus /Half
@seems like the solution is already implemented in this call: (model1a_CPU = T5ForConditionalGeneration.from_pretrained(best_model_path, low_cpu_mem_usage=True,torch_dtype=torch.float16).to("cuda"))
Probably it is trigered by torch_dtype=torch.float16. So a part of model is (likely) moved to fp32 from fp16, so it works properly, exactly the same as with FP32, and exactly the same as on CPU.
Of course it does use a little bit more of memory. When you call it second way, the memory usage is around 2.5 GB for T5-large, while with first it is around 2.9GB. It is slower around 10-15 percent.
Environment info
transformers
version: 4.5.0.dev0Who can help
@patil-suraj @patrickvonplaten
Information
Model I am using (Bert, XLNet ...): t5-large
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior (the fix I'm suggesting is very simple, so perhaps there is no reason to reproduce):
Expected behavior
Training without nans.
Possible fix
I debugged and saw that we get nans at the
modeling_t5.py
script in line 241By modifing this line to:
It seems to be solved.
BTW it happens in the last layers (this might explain why it wasn't caught in this fix)
seq2seq.zip