getting nans with t5-large + fix

yuvalkirstain commented 3 years ago

Environment info

transformers version: 4.5.0.dev0
Platform: Linux-4.15.0-65-generic-x86_64-with-glibc2.10
Python version: 3.8.8
PyTorch version (GPU?): 1.7.1+cu101 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@patil-suraj @patrickvonplaten

Information

Model I am using (Bert, XLNet ...): t5-large

The problem arises when using:

[ ] my own modified scripts: run_seq2seq with minor modifications (attached)

The tasks I am working on is:

[ ] my own task or dataset: Closed-Book Open Domain QA

To reproduce

Steps to reproduce the behavior (the fix I'm suggesting is very simple, so perhaps there is no reason to reproduce):

unzip the attached zip (below).

run

python run_seq2seq.py --model_name_or_path=t5-large
--do_train
--do_eval
--task=qa
--train_file=data/PAQ.filtered.regular.16000.json
--validation_file=data/PAQ.filtered.regular.16000.json
--output_dir=results/5e-5-t5-large-4096000-128-140-1792000-0.1-regular-true-4
--overwrite_output_dir
--per_device_train_batch_size=1
--per_device_eval_batch_size=128
--predict_with_generate
--fp16
--max_steps=1000
--evaluation_strategy=steps
--text_column=question
--summary_column=answer
--save_total_limit=5
--cache_dir=../.cache
--save_steps=500000
--learning_rate=5e-5
--eval_steps=96000
--warmup_steps=100
--run_name=5e-5-t5-large-4096000-128-140-1792000-0.1-regular-true-4
--dropout_rate=0.1
--gradient_accumulation_steps=1
--logging_steps=1

Expected behavior

Training without nans.

Possible fix

I debugged and saw that we get nans at the modeling_t5.py script in line 241

hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)

By modifing this line to:

clamp_value = torch.finfo(hidden_states.dtype).max - 1000
hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value) * torch.rsqrt(variance + self.variance_epsilon)

It seems to be solved.

BTW it happens in the last layers (this might explain why it wasn't caught in this fix)

seq2seq.zip

dorost1234 commented 3 years ago

Hi I also observe the similar issue with mt5 models, https://github.com/huggingface/transformers/issues/10819 , deepspeed is still not working for me due to this issue with mt5 models. I greatly appreciate having a look @patil-suraj @patrickvonplaten

patrickvonplaten commented 3 years ago

We didn't really manage to resolve the problems with t5/mt5 + mixed precision fp16 (cc @patil-suraj). I'm not sure whether anybody has tried internally to fine-tune t5/mt5 with deepspeed (@stas00 maybe?)

dorost1234 commented 3 years ago

the issue arises without deepspeed, just vanilla mt5-small model. Also, I see similar nans with deepspeed with a model based on mt5-small slightly modified, please see the issue here https://github.com/huggingface/transformers/issues/10821#issuecomment-803453998, I think if the issue with fp16 option could get resolved, hopefully this will be also more stable with model changes in deepspeed as well. Thanks a lot.

stas00 commented 3 years ago

Indeed, this has nothing to do with deepspeed, other than that deepspeed trains in mixed precision and evals in full fp16 at the moment.

I've started studying the bfloat16 vs. float16 numerical properties and their correlation to each other. And once I understand it well I will try to see if there some sort of magical remapping that perhaps could be done - this is my fantasy of course. I just need to finish a few other more urgent things with deepspeed stage3 integration first.

But please don't let my comment prevent you from merging the proposed fix if it already solves the problem.

dorooddorood606 commented 3 years ago

I got similar issue with mt5 model, @patrickvonplaten thanks a lot in advance for your help

stas00 commented 3 years ago

@dorost1234 + @yuvalkirstain, please kindly try this branch: https://github.com/huggingface/transformers/tree/t5-fp16-no-nans and let me know if it solves the problem - It seems that the problem is due to autocast in T5LayerFF so this branch tries to turn off autocast just for that layer. It also disables the previously added clamping.

There is also a lot of debug statements in the branch but they will be silent unless nan/inf is detected.

I tested it work on a small sample with t5-small/t5-base/t5-large/google/mt5-small.

The main part of the fix is just:

class T5LayerFF(nn.Module):
    def forward(self, hidden_states):
        with torch.cuda.amp.autocast(enabled=False):
            forwarded_states = self.layer_norm(hidden_states)
            forwarded_states = self.DenseReluDense(forwarded_states)
            hidden_states = hidden_states + self.dropout(forwarded_states)
        return hidden_states

and removing some code. So use the branch first.

If it works I guess we could just monkey patch this version for AMP or come up with some cleaner solution. Probably with torch.is_autocast_enabled() check

dorost1234 commented 3 years ago

Dear @stas00 Thank you very much for taking time looking into this issue, this would be really awesome if this could fix the issue, I tried to test it, for this I got the branch, and then I install it locally with "python setup.py develop", then I run this command:

python run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir /temp/test --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --logging_step 10 --fp16

I got this error:

Traceback (most recent call last):
  File "run_translation.py", line 562, in <module>
    main()
  File "run_translation.py", line 448, in main
    pad_to_multiple_of=8 if training_args.fp16 else None,
TypeError: __init__() got an unexpected keyword argument 'model'

I think there is some version mismatch. I removed the model from input to the collator, as below

   data_collator = DataCollatorForSeq2Seq(
            tokenizer,
            #model=model,
            label_pad_token_id=label_pad_token_id,
            pad_to_multiple_of=8 if training_args.fp16 else None,
        )

and then here is what I got with fp16 option:

{'loss': 23.3523, 'learning_rate': 4.999890767684712e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 22.5557, 'learning_rate': 4.999781535369424e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 25.9471, 'learning_rate': 4.999672303054136e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 23.0994, 'learning_rate': 4.9995630707388475e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 24.9974, 'learning_rate': 4.999453838423559e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 23.3743, 'learning_rate': 4.999344606108271e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 24.2147, 'learning_rate': 4.999235373792983e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 26.7845, 'learning_rate': 4.9991261414776954e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 25.2277, 'learning_rate': 4.9990169091624065e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 23.3156, 'learning_rate': 4.998907676847119e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 21.275, 'learning_rate': 4.99879844453183e-05, 'epoch': 0.0}                                                                                                                         
{'loss': 23.7031, 'learning_rate': 4.9986892122165426e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 23.8086, 'learning_rate': 4.9985799799012544e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 25.8143, 'learning_rate': 4.998470747585966e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 24.4319, 'learning_rate': 4.998361515270678e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 26.8277, 'learning_rate': 4.99825228295539e-05, 'epoch': 0.0}

here is loss without fp16:

{'loss': 27.0258, 'learning_rate': 4.999890767684712e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 23.141, 'learning_rate': 4.999781535369424e-05, 'epoch': 0.0}                                                                                                                        
{'loss': 21.2312, 'learning_rate': 4.999672303054136e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 19.3567, 'learning_rate': 4.9995630707388475e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 18.7998, 'learning_rate': 4.999453838423559e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 17.9632, 'learning_rate': 4.999344606108271e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 17.2105, 'learning_rate': 4.999235373792983e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 17.5506, 'learning_rate': 4.9991261414776954e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 15.2566, 'learning_rate': 4.9990169091624065e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 14.8667, 'learning_rate': 4.998907676847119e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 13.7132, 'learning_rate': 4.99879844453183e-05, 'epoch': 0.0}                                                                                                                        
{'loss': 13.4058, 'learning_rate': 4.9986892122165426e-05, 'epoch': 0.0

So I think this is not optimizing the loss well. I greatly appreciate having a look. Thanks a lot.

stas00 commented 3 years ago

re errors - this is all on master - the source code and run_translation.py. When you install pip install -e . sometimes conda/pip don't clean up an old install, so it helps to do pip uninstall transformers -y at least 2 times!

I solve such problems by running locally and not relying on the installed transformers, i.e.:

git clone https://github.com/huggingface/transformers
cd transformers
PYTHONPATH=src python examples/seq2seq/run_translation.py ...

now you never need to worry about what transformers version is installed in the environment.

wrt not getting the loss going down - this is odd, I just run your code:

PYTHONPATH=src python examples/seq2seq/run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir /tmp/test --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --logging_step 10 --fp16

{'loss': 29.7519, 'learning_rate': 4.999781535369424e-05, 'epoch': 0.0}                                                                                           
{'loss': 26.3593, 'learning_rate': 4.9995630707388475e-05, 'epoch': 0.0}                                                                                          
{'loss': 23.4431, 'learning_rate': 4.999344606108271e-05, 'epoch': 0.0}                                                                                           
{'loss': 21.431, 'learning_rate': 4.9991261414776954e-05, 'epoch': 0.0}                                                                                           
{'loss': 19.2445, 'learning_rate': 4.998907676847119e-05, 'epoch': 0.0}                                                                                           
{'loss': 17.8293, 'learning_rate': 4.9986892122165426e-05, 'epoch': 0.0}                                                                                          
{'loss': 16.9441, 'learning_rate': 4.998470747585966e-05, 'epoch': 0.0}                                                                                           
{'loss': 15.7572, 'learning_rate': 4.99825228295539e-05, 'epoch': 0.0}
{'loss': 15.2937, 'learning_rate': 4.9980338183248135e-05, 'epoch': 0.0}
{'loss': 14.4368, 'learning_rate': 4.997815353694237e-05, 'epoch': 0.0}
{'loss': 14.6709, 'learning_rate': 4.997596889063661e-05, 'epoch': 0.0}
{'loss': 13.2806, 'learning_rate': 4.9973784244330843e-05, 'epoch': 0.0}
{'loss': 12.9245, 'learning_rate': 4.997159959802508e-05, 'epoch': 0.0}
{'loss': 12.4647, 'learning_rate': 4.9969414951719316e-05, 'epoch': 0.0}
{'loss': 11.4738, 'learning_rate': 4.996723030541355e-05, 'epoch': 0.0}

Must be your hardware? Try to lower the learning rate?

I tried with 1 or 2 gpus and it worked in both cases.

dorost1234 commented 3 years ago

Hi @stas00 thank you very much for the pointers, I did it as you mentioned and now I see this is going down nicely

{'loss': 28.1802, 'learning_rate': 4.999890767684712e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 27.4353, 'learning_rate': 4.999781535369424e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 21.3904, 'learning_rate': 4.999672303054136e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 22.8854, 'learning_rate': 4.9995630707388475e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 19.6943, 'learning_rate': 4.999453838423559e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 21.253, 'learning_rate': 4.999344606108271e-05, 'epoch': 0.0}                                                                                                                        
{'loss': 20.1937, 'learning_rate': 4.999235373792983e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 18.6606, 'learning_rate': 4.9991261414776954e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 18.0337, 'learning_rate': 4.9990169091624065e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 16.1259, 'learning_rate': 4.998907676847119e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 15.4007, 'learning_rate': 4.99879844453183e-05, 'epoch': 0.0}                                                                                                                        
{'loss': 15.6753, 'learning_rate': 4.9986892122165426e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 15.0481, 'learning_rate': 4.9985799799012544e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 14.5833, 'learning_rate': 4.998470747585966e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 14.0758, 'learning_rate': 4.998361515270678e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 13.7096, 'learning_rate': 4.99825228295539e-05, 'epoch': 0.0}                                                                                                                        
{'loss': 13.3216, 'learning_rate': 4.998143050640102e-05, 'epoch': 0.0}                                                                                                                       
{'loss': 13.2331, 'learning_rate': 4.9980338183248135e-05, 'epoch': 0.0}                                                                                                                      
{'loss': 12.1556, 'learning_rate': 4.997924586009525e-05, 'epoch': 0.0}

This is such a great, wonderful, amazing fix. Looking forward to using it when this is pushed to the repository. For all the hard problems, you are our only hope @stas00 Thank you very much for this great fix.

stas00 commented 3 years ago

Thank you for your kind words, I'm so happy to hear that it worked, @dorost1234.

I will make a proper PR after I clean this branch up.

stas00 commented 3 years ago

@yuvalkirstain, please kindly test if this PR fixes the problem: https://github.com/huggingface/transformers/pull/10956

yuvalkirstain commented 3 years ago

Thank you @stas00 ! It seems to work were my proposed fix failed with T5-Small. I will now run some additional experiments with T5-Large and update.

stas00 commented 3 years ago

Thank you for validating that, @yuvalkirstain!

Indeed, I tried first local fixes but the problem would just pop-up elsewhere.

I'm just thinking that perhaps we could find if it's all calls to FF that lead to the problem or only some of them, and then we could optimize the solution I proposed by only disabling autocast in some cases and not all. I haven't tested that yet.

If you experiment I recommend for you to try my branch, since I left the "detector" on and it'll immediately tell you when the first inf is encountered.

What I'm most interested in is some longer runs to ensure it doesn't start overflowing at a later point.

Thank you for your contribution.

yuvalkirstain commented 3 years ago

Finetuned T5-Base using this branch with the standard T5 finetuning HPs on NQ (except from batch_size - used only ~26k tokens) and didn't get nans (it has been running for over 3 hours and training converged). Thanks again, I guess the issue can be closed for time being.

stas00 commented 3 years ago

Thank you for this validation, @yuvalkirstain. I still would like to see if we can find a more efficient solution before merging it, but this is great that we have one that works.

This unfortunately doesn't help with deepspeed since it doesn't use pytorch AMP and has its own version, but which doesn't use context manager so can't be turned off locally like autocast. So we hope to find a different solution.

I linked this issue to the PR so it'll get closed automatically when it's merged.

yuvalkirstain commented 3 years ago

Well, the nans are back.

T5LayerFF: 1 has inf T5LayerNorm has inf T5LayerNorm variance has inf T5LayerNorm hidden_states has nans T5LayerNorm hidden_states before return has nans T5LayerFF: 2 has nans T5LayerFF: 3 has nans T5LayerFF: 5 has nans T5Block after T5LayerFF has nans T5Stack loop end has nans T5LayerNorm has nans T5LayerNorm variance has nans T5LayerNorm hidden_states has nans T5LayerNorm hidden_states before return has nans

The model I used here was T5-large-ssm-nqo. @stas00 If you'd like to replicate I can send the relevant training file + command.

stas00 commented 3 years ago

Yes, please, I'm working in parallel on gpt-neo that has the same issues, so the more reproducible cases we have the higher are the chances we can find a solid fix.

Also those would be good candidates for tests (hoping that we can find a quick way to get to overflow).

stas00 commented 3 years ago

Let's continue the discussion in the PR that is trying to solve this issue: https://github.com/huggingface/transformers/pull/10956

Sahajtomar commented 3 years ago

@dorost1234 hI, Could you please tell me how you solved this loss optimization problem. I am facing same issue

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oxi84 commented 1 year ago

So is this fix now in the main version of transformers?

Oxi84 commented 1 year ago

I found that results are different when you load like this: (first is better)

model1a_CPU = T5ForConditionalGeneration.from_pretrained(best_model_path, low_cpu_mem_usage=True,torch_dtype=torch.float16).to("cuda")

than when you load via:

model1a_CPU = T5ForConditionalGeneration.from_pretrained(best_model_path, low_cpu_mem_usage=True) model1a_CPU.half() model1a_CPU.eval()
model1a_CPU.to("cuda")

So this could be a solution, I will compare result on /CPU versus /This versus /Half

Oxi84 commented 1 year ago

@seems like the solution is already implemented in this call: (model1a_CPU = T5ForConditionalGeneration.from_pretrained(best_model_path, low_cpu_mem_usage=True,torch_dtype=torch.float16).to("cuda"))

Probably it is trigered by torch_dtype=torch.float16. So a part of model is (likely) moved to fp32 from fp16, so it works properly, exactly the same as with FP32, and exactly the same as on CPU.

Of course it does use a little bit more of memory. When you call it second way, the memory usage is around 2.5 GB for T5-large, while with first it is around 2.9GB. It is slower around 10-15 percent.

huggingface / transformers