FP16 overflow with GPT-Neo when using sequence lengths of 2048.

LouisCastricato commented 3 years ago

Environment info

transformers version: 4.5.0.dev0
Platform: Linux-5.4.0-54-generic-x86_64-with-glibc2.29
Python version: 3.8.5
PyTorch version (GPU?): 1.8.0+cu111
Tensorflow version (GPU?): N/A
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@stas00

Models:

GPT-Neo 1.3b

Library:

deepspeed: @stas00

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Use GPT-Neo 1.3b with The Pile dataset and built in trainer. Artificial data also suffices. It does not matter what the data is, as long as the attention mask spans all 2048 tokens.
Enable FP16 and set max_length to 2048
Observe that all loses reported are NaN

Also reproducible using AMP or DeepSpeed. It seems like there is code to circumvent this outlined in the GPT-Neo implementation where q,k,v are casted to fp32 in the attention block.

When the max_length is shorter (512) this overflow does not occur.

Expected behavior

I expected no overflows.

Aside

I'm reaching out on behalf of EleutherAI, Lysandre told us to create an issue about this.

LouisCastricato commented 3 years ago

OP is asking to support bf16 training, but you're asking for fp16 training. These two are significantly different issues.

It'd be awesome for deepspeed to support bf16, but this is not going to help users w/o hardware natively supporting bf16.

I meant the changes they recommended making could also help resolve our FP16 issues. They outlined what would need to be changed for bf16

stas00 commented 3 years ago

Currently the very first problem is deepspeed calling model.half(), which leads to immediate underflow in model weights. As I have shown above:

torch.tensor(2.32e-11).to(dtype=torch.float16)
# tensor(0., dtype=torch.float16)

Therefore I can't see how any of the suggestions directed to support bf16 training would help in this case.

Chances are that deepspeed will need a new mode, which is not all-fp16 and only doing fp16 conversion when it's safe to do so and scaling the weights and activations up/down when they are in unsafe for the fp16 range. So it won't be as slow / memory demanding as a full fp32 mode, but it won't be a normal fp16 mixed precision.

stas00 commented 3 years ago

OK, please have a look at the current setup on your instance, try:

PYTHONPATH=~/DeepSpeed deepspeed --num_gpus 1 distill.py --deepspeed_config ds_config_zero3.json --debug

~/DeepSpeed currently contains an experimental branch by @samyam https://github.com/microsoft/DeepSpeed/tree/samyamr/full-precision-for-stage3 who created a semi-fp32 deepspeed mode that according to him should be only 2x slower than normal mixed precision fp16 for bfloat16-pretrained models, but much faster than fp32.

It also currently requires a hardcoded change:

diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index 82a0a9917..9a23bc55b 100755
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -1085,7 +1085,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):

             logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this model")
             # this immediately partitions the model to avoid the overhead in time and memory copying it on CPU or each GPU first
-            with deepspeed.zero.Init():
+            with deepspeed.zero.Init(dtype=torch.float):
                 model = cls(config, *model_args, **model_kwargs)
         else:
             model = cls(config, *model_args, **model_kwargs)

which is already applied under ~/transformers-stas/ and also fp16 is set to false in ds config files. i.e. it is all already setup for you.

so this is zero3. zero2 still needs some work.

stas00 commented 3 years ago

But looking closer at your code, I see now that we have been trying to solve the wrong problem all along.

Why is your code using "EleutherAI/gpt-neo-2.7B", when one of you said earlier was pre-trained in full fp32? how could you possibly expect it to train or eval in fp16? or did you just want deepspeed in fp32 mode? Please clarify.

One of you said it's 1.3B checkpoint that was trained in bf16.

stas00 commented 3 years ago

OK, zero2 now works too.

PYTHONPATH=~/DeepSpeed deepspeed --num_gpus 1 distill.py --deepspeed_config ds_config.json --debug

So Samyam explained that this new deepspeed branch enables full FP32 mode.

But since your setup is running on A100, pytorch uses TF32, so you're getting an equivalent speed to fp16 on V100.

RTX-3090 should also be able to get this performance.

All kudos go to @samyam.

leogao2 commented 3 years ago

But looking closer at your code, I see now that we have been trying to solve the wrong problem all along.

Why is your code using "EleutherAI/gpt-neo-2.7B", when one of you said earlier was pre-trained in full fp32? how could you possibly expect it to train or eval in fp16? or did you just want deepspeed in fp32 mode? Please clarify.

One of you said it's 1.3B checkpoint that was trained in bf16.

We've been having the nan issue with both the bf16 1.3B checkpoint and the fp32 2.7B checkpoint; we were under the assumption that as both have the same dynamic range, both would have the same under/overflow problems. I'm also pretty sure that the bf16 1.3B checkpoint was trained with bf16 activations with fp32 master weights quantized to bf16 (the quantization was a mistake by one of our devs).

Our main problem is that with fp32, 1.3B, and no deepspeed, we can't even fit a single full batch without OOM, and we can't turn on any deepspeed optimizations without fp16 being on (interestingly, it seems the OOM doesn't happen with Samyam's branch). Of course, we would like to train our model using mixed-precision (using fp32 for the parts that are underflowing) for the obvious memory savings, so we thought it would be much easier to just make our model work with mixed-precision and also get those memory savings than to make deepspeed work with fp32. We would also be fine with making deepspeed work with fp32 or bf16 if it's significantly easier.

Thanks for all your time in helping us with this issue.

stas00 commented 3 years ago

In general if you want users to be able to use fp16 mixed precision for fine-tuning and inference you need to pre-train the model using this mode. For some models we find certain workarounds that localize switching to fp32 for specific submodules, that lead to underflow/overflow under fp16, but often users still get NaNs during long training.

Bottom line, if you pre-train in bf16 be prepared to tell users to use fp32 or bf16 in their fine-tuning/inference processes. As the new hardware supporting bf16/tf32 formats emerges (rtx-3090 + a100) this will be come the simple go-to solution in the future.

Now that deepspeed will have a full-fp32 mode this is great.

So to summarize, at this moment with Samyam's branch if you use:

zero2 you just need to do fp16.enable=false in ds config
zero3, same as above, plus zero.Init(dtype=torch.float) is needed in modeling_utils.py (instead of just zero.Init()) - I need to think how to make that configurable.

LouisCastricato commented 3 years ago

How would one use this special fp32 mode without zero?

stas00 commented 3 years ago

You mean w/o deepspeed (or fairscale)?

Just don't enable mixed precision in the training. i.e. in transformers don't use --fp16 in train and don't use --fp16_full_eval in eval.

Unless you ask how to use deepspeed w/o zero - why would you want to do that? ZeRO is the core of deepspeed and if you are not using it, you don't really need deepspeed.

If I misunderstood your question please clarify.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

stas00 commented 2 years ago

@LouisCastricato

After checking our internal documents we realized that 1.3b is bfp16 where as 2.7b is fp32

You wrote: bfp16

Did you mean to write fp16 or bf16?

According to the detector tool I'm working on it is most likely fp16. It'd be super helpful if you could check on how it was trained. Thank you!

If you have other published model checkpoints and their dtype that would be very helpful too, as I'm trying to gather that information.

stas00 commented 2 years ago

Talked to Stella and she confirmed Louis meant to write bf16 for 1.3B model.

huggingface / transformers