Closed LouisCastricato closed 3 years ago
OP is asking to support
bf16
training, but you're asking forfp16
training. These two are significantly different issues.It'd be awesome for deepspeed to support
bf16
, but this is not going to help users w/o hardware natively supportingbf16
.
I meant the changes they recommended making could also help resolve our FP16 issues. They outlined what would need to be changed for bf16
Currently the very first problem is deepspeed calling model.half()
, which leads to immediate underflow in model weights. As I have shown above:
torch.tensor(2.32e-11).to(dtype=torch.float16)
# tensor(0., dtype=torch.float16)
Therefore I can't see how any of the suggestions directed to support bf16
training would help in this case.
Chances are that deepspeed will need a new mode, which is not all-fp16
and only doing fp16
conversion when it's safe to do so and scaling the weights and activations up/down when they are in unsafe for the fp16
range. So it won't be as slow / memory demanding as a full fp32 mode, but it won't be a normal fp16 mixed precision.
OK, please have a look at the current setup on your instance, try:
PYTHONPATH=~/DeepSpeed deepspeed --num_gpus 1 distill.py --deepspeed_config ds_config_zero3.json --debug
~/DeepSpeed
currently contains an experimental branch by @samyam https://github.com/microsoft/DeepSpeed/tree/samyamr/full-precision-for-stage3
who created a semi-fp32 deepspeed mode that according to him should be only 2x slower than normal mixed precision fp16 for bfloat16-pretrained models, but much faster than fp32.
It also currently requires a hardcoded change:
diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index 82a0a9917..9a23bc55b 100755
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -1085,7 +1085,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin):
logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this model")
# this immediately partitions the model to avoid the overhead in time and memory copying it on CPU or each GPU first
- with deepspeed.zero.Init():
+ with deepspeed.zero.Init(dtype=torch.float):
model = cls(config, *model_args, **model_kwargs)
else:
model = cls(config, *model_args, **model_kwargs)
which is already applied under ~/transformers-stas/
and also fp16
is set to false
in ds config files. i.e. it is all already setup for you.
so this is zero3. zero2 still needs some work.
But looking closer at your code, I see now that we have been trying to solve the wrong problem all along.
Why is your code using "EleutherAI/gpt-neo-2.7B", when one of you said earlier was pre-trained in full fp32? how could you possibly expect it to train or eval in fp16? or did you just want deepspeed in fp32 mode? Please clarify.
One of you said it's 1.3B checkpoint that was trained in bf16
.
OK, zero2 now works too.
PYTHONPATH=~/DeepSpeed deepspeed --num_gpus 1 distill.py --deepspeed_config ds_config.json --debug
So Samyam explained that this new deepspeed branch enables full FP32 mode.
But since your setup is running on A100, pytorch uses TF32, so you're getting an equivalent speed to fp16 on V100.
RTX-3090 should also be able to get this performance.
All kudos go to @samyam.
But looking closer at your code, I see now that we have been trying to solve the wrong problem all along.
Why is your code using "EleutherAI/gpt-neo-2.7B", when one of you said earlier was pre-trained in full fp32? how could you possibly expect it to train or eval in fp16? or did you just want deepspeed in fp32 mode? Please clarify.
One of you said it's 1.3B checkpoint that was trained in
bf16
.
We've been having the nan issue with both the bf16 1.3B checkpoint and the fp32 2.7B checkpoint; we were under the assumption that as both have the same dynamic range, both would have the same under/overflow problems. I'm also pretty sure that the bf16 1.3B checkpoint was trained with bf16 activations with fp32 master weights quantized to bf16 (the quantization was a mistake by one of our devs).
Our main problem is that with fp32, 1.3B, and no deepspeed, we can't even fit a single full batch without OOM, and we can't turn on any deepspeed optimizations without fp16 being on (interestingly, it seems the OOM doesn't happen with Samyam's branch). Of course, we would like to train our model using mixed-precision (using fp32 for the parts that are underflowing) for the obvious memory savings, so we thought it would be much easier to just make our model work with mixed-precision and also get those memory savings than to make deepspeed work with fp32. We would also be fine with making deepspeed work with fp32 or bf16 if it's significantly easier.
Thanks for all your time in helping us with this issue.
In general if you want users to be able to use fp16 mixed precision for fine-tuning and inference you need to pre-train the model using this mode. For some models we find certain workarounds that localize switching to fp32 for specific submodules, that lead to underflow/overflow under fp16, but often users still get NaNs during long training.
Bottom line, if you pre-train in bf16 be prepared to tell users to use fp32 or bf16 in their fine-tuning/inference processes. As the new hardware supporting bf16/tf32 formats emerges (rtx-3090 + a100) this will be come the simple go-to solution in the future.
Now that deepspeed will have a full-fp32 mode this is great.
So to summarize, at this moment with Samyam's branch if you use:
fp16.enable=false
in ds configzero.Init(dtype=torch.float)
is needed in modeling_utils.py
(instead of just zero.Init()
) - I need to think how to make that configurable.How would one use this special fp32 mode without zero?
You mean w/o deepspeed (or fairscale)?
Just don't enable mixed precision in the training. i.e. in transformers
don't use --fp16
in train and don't use --fp16_full_eval
in eval.
Unless you ask how to use deepspeed w/o zero - why would you want to do that? ZeRO is the core of deepspeed and if you are not using it, you don't really need deepspeed.
If I misunderstood your question please clarify.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@LouisCastricato
After checking our internal documents we realized that 1.3b is bfp16 where as 2.7b is fp32
You wrote: bfp16
Did you mean to write fp16 or bf16?
According to the detector tool I'm working on it is most likely fp16. It'd be super helpful if you could check on how it was trained. Thank you!
If you have other published model checkpoints and their dtype that would be very helpful too, as I'm trying to gather that information.
Talked to Stella and she confirmed Louis meant to write bf16 for 1.3B model.
Environment info
transformers
version: 4.5.0.dev0Who can help
@stas00
Models:
Library:
Information
Model I am using (Bert, XLNet ...):
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Also reproducible using AMP or DeepSpeed. It seems like there is code to circumvent this outlined in the GPT-Neo implementation where q,k,v are casted to fp32 in the attention block.
When the max_length is shorter (512) this overflow does not occur.
Expected behavior
I expected no overflows.
Aside
I'm reaching out on behalf of EleutherAI, Lysandre told us to create an issue about this.