Closed kikutakou closed 1 year ago
PR is prepared as #24262
cc @younesbelkada @ArthurZucker
@younesbelkada @ArthurZucker Hi. This is just a friendly reminder.
Hi @kikutakou
For fp16 models it is important to calculate the attention scores in full precision, mainly for numerical stability reasons. Check out for instance: https://github.com/huggingface/transformers/issues/17433 or the thread in (that includes authors from OPT models) https://github.com/huggingface/transformers/pull/17437 to start with. So the computation inside attention module to calculate attn_weights
should always stay in full precision.
Regarding the positional embeddings, looking at the official implementation, it seems that indeed the positional embeddings are returned in half-precision: https://github.com/EleutherAI/gpt-neox/blob/main/megatron/model/positional_embeddings.py#L48 . Maybe @StellaAthena can help us confirm if the rotary embeddings should return fp16 values in half-precision modes
For rope, there was an attempt to fix this here: #23837, as it seems that in the original code they are re-computed each forward, with the correct dtype. It's very detailed!
Hi @kikutakou For fp16 models it is important to calculate the attention scores in full precision, mainly for numerical stability reasons. Check out for instance: #17433 or the thread in (that includes authors from OPT models) #17437 to start with. So the computation inside attention module to calculate
attn_weights
should always stay in full precision. Regarding the positional embeddings, looking at the official implementation, it seems that indeed the positional embeddings are returned in half-precision: https://github.com/EleutherAI/gpt-neox/blob/main/megatron/model/positional_embeddings.py#L48 . Maybe @StellaAthena can help us confirm if the rotary embeddings should return fp16 values in half-precision modes
I have no reason to think that you can’t compute rotary embed signs in half-precision.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.30.1@ArthurZucker and @younesbelkada
Hi. I'm using a model,
GPTNeoXForCausalLM
(defined insrc/transformers/models/gpt_neox/modeling_gpt_neox.py
) with torch.float16 precision by calling.from_pretrained(torch_dtype=torch.float16)
. In this mode, the model is expected to calculate in float16 precision to save GPU memory usage. However, some of variables in this model remain float32 and don't turn to float16, and they affects the subsequent calculation. Eventually, the weight attention, which can be dominant memory consumer, is calculated in float32. GPU memory won't be saved as we expected.The following is the problem detail:
GPTNeoXForCausalLM
withtorch_dtype=torch.float16
self.cos_cached
andself.sin_cached
inRotaryEmbedding
class held byGPTNeoXAttention
are calcurated as float32 in init().GPTNeoXAttention.forward()
callsRotaryEmbedding.forward()
.RotaryEmbedding.forward()
prepare the return values in float32.GPTNeoXAttention.forward()
receives the return values in float32.attn_weights
are calculated in float32.attn_weights = attn_weights.to(value.dtype)
is called andattn_weights
is returned to float16.Because of step 7, the model forward() returns the float16, but it consumes float32 GPU footprint internally.
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Here is a sample code.
Expected behavior
It prints all dtypes if you execute on ko_gptneox_fp16_debug branch. All dtypes are expected to be float16, but actually float 32.