huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.11k stars 26.32k forks source link

GPTNeoXAttention takes extra GPU memory footprint in torch.float16 precision mode. #24261

Closed kikutakou closed 1 year ago

kikutakou commented 1 year ago

System Info

@ArthurZucker and @younesbelkada

Hi. I'm using a model, GPTNeoXForCausalLM (defined in src/transformers/models/gpt_neox/modeling_gpt_neox.py) with torch.float16 precision by calling .from_pretrained(torch_dtype=torch.float16). In this mode, the model is expected to calculate in float16 precision to save GPU memory usage. However, some of variables in this model remain float32 and don't turn to float16, and they affects the subsequent calculation. Eventually, the weight attention, which can be dominant memory consumer, is calculated in float32. GPU memory won't be saved as we expected.

The following is the problem detail:

  1. setup model GPTNeoXForCausalLM with torch_dtype=torch.float16
  2. self.cos_cached and self.sin_cached in RotaryEmbedding class held by GPTNeoXAttention are calcurated as float32 in init().
  3. GPTNeoXAttention.forward() calls RotaryEmbedding.forward().
  4. RotaryEmbedding.forward() prepare the return values in float32.
  5. GPTNeoXAttention.forward() receives the return values in float32.
  6. Hereafter, all variables including attn_weights are calculated in float32.
  7. attn_weights = attn_weights.to(value.dtype) is called and attn_weights is returned to float16.

Because of step 7, the model forward() returns the float16, but it consumes float32 GPU footprint internally.

Information

Tasks

Reproduction

  1. Checkout to ko_gptneox_fp16_debug branch in https://github.com/kikutakou/transformers (this branch only has additional debug print code on origin/main)
  2. setup model by GPTNeoXForCausalLM.from_pretrained with torch_dtype=torch.float16
  3. model.forward()

Here is a sample code.

import torch
from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast

torch.manual_seed(0)

MODEL_NAME = 'cyberagent/gpt-neox-1b-japanese'

# load text
input_text = 'this is test'

# tokenize text
tokenizer = GPTNeoXTokenizerFast.from_pretrained(MODEL_NAME, use_auth_token=True)
t = tokenizer(input_text, return_tensors='pt', truncation=True, padding='longest', add_special_tokens=False)
input_ids = t['input_ids'].cuda()
attention_mask = t['attention_mask'].cuda()
input_len = len(input_ids[0])

model = GPTNeoXForCausalLM.from_pretrained(MODEL_NAME, low_cpu_mem_usage=True,
                                           use_auth_token=True, torch_dtype=torch.float16)

model.eval()
model.cuda()

# generate
generation_len = (input_len + 50)
batch_params = dict(input_ids=input_ids,
                    attention_mask=attention_mask,
                    repetition_penalty=None, num_return_sequences=3, num_beams=1, do_sample=True,
                    temperature=None, top_p=0.95, pad_token_id=1, max_length=generation_len)
output_ids = model.generate(**batch_params).cpu()[0]

# decode
output_ids = output_ids[input_len:]
decoded = tokenizer.decode(output_ids, skip_special_tokens=False)
print(decoded)

Expected behavior

It prints all dtypes if you execute on ko_gptneox_fp16_debug branch. All dtypes are expected to be float16, but actually float 32.

kikutakou commented 1 year ago

PR is prepared as #24262

amyeroberts commented 1 year ago

cc @younesbelkada @ArthurZucker

kikutakou commented 1 year ago

@younesbelkada @ArthurZucker Hi. This is just a friendly reminder.

younesbelkada commented 1 year ago

Hi @kikutakou For fp16 models it is important to calculate the attention scores in full precision, mainly for numerical stability reasons. Check out for instance: https://github.com/huggingface/transformers/issues/17433 or the thread in (that includes authors from OPT models) https://github.com/huggingface/transformers/pull/17437 to start with. So the computation inside attention module to calculate attn_weights should always stay in full precision. Regarding the positional embeddings, looking at the official implementation, it seems that indeed the positional embeddings are returned in half-precision: https://github.com/EleutherAI/gpt-neox/blob/main/megatron/model/positional_embeddings.py#L48 . Maybe @StellaAthena can help us confirm if the rotary embeddings should return fp16 values in half-precision modes

ArthurZucker commented 1 year ago

For rope, there was an attempt to fix this here: #23837, as it seems that in the original code they are re-computed each forward, with the correct dtype. It's very detailed!

StellaAthena commented 1 year ago

Hi @kikutakou For fp16 models it is important to calculate the attention scores in full precision, mainly for numerical stability reasons. Check out for instance: #17433 or the thread in (that includes authors from OPT models) #17437 to start with. So the computation inside attention module to calculate attn_weights should always stay in full precision. Regarding the positional embeddings, looking at the official implementation, it seems that indeed the positional embeddings are returned in half-precision: https://github.com/EleutherAI/gpt-neox/blob/main/megatron/model/positional_embeddings.py#L48 . Maybe @StellaAthena can help us confirm if the rotary embeddings should return fp16 values in half-precision modes

I have no reason to think that you can’t compute rotary embed signs in half-precision.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.