huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.11k stars 26.32k forks source link

GPTNeoXRotaryEmbedding has a defect when using sin/cos cache #25813

Closed underskies00 closed 11 months ago

underskies00 commented 1 year ago

System Info

Who can help?

Text models:@ArthurZucker and @younesbelkada

Information

Tasks

Reproduction

in transformers/models/gpt_neox/modeling_gpt_neox.py, line320:

return self.cos_cached[:seq_len, ...].to(x.device), self.sin_cached[:seq_len, ...].to(x.device)

lack 2 dimensions before seq_lenth. This will not cause bug, because seq_lenth is always larger than 1, but it will cause a defect use of cache by use the whole cache ever, maybe lead to poor performance when inference?

Expected behavior

Maybe the right code should be:

return self.cos_cached[:, :, :seq_len, ...].to(x.device), self.sin_cached[:, :, :seq_len, ...].to(x.device)
ydshieh commented 1 year ago

Looks like you are right, as this is the case for models in llama, where we have

        return (
            self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
            self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
        )

@ArthurZucker could you like to double confirm here?

ArthurZucker commented 1 year ago

Yep indeed. Would you like to open a PR @underskies00 ?

fyi @gante if there's a reason we did not merge this for GPTNeoX maybe?

gante commented 1 year ago

@ArthurZucker looking at the git blame, this has been present since the 1st commit :D Possibly flew under the radar.

Technically harmless (since seq_len is almost always > batch size), but in need of fixing.

ArthurZucker commented 1 year ago

(Actually for a simple case like this:

from transformers import AutoTokenizer, GPTNeoXForCausalLM, AutoConfig

config = AutoConfig.from_pretrained("EleutherAI/gpt-neox-20b", num_hidden_layers = 5)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b", )
model = GPTNeoXForCausalLM(config).cuda()

inputs = tokenizer("Hey how are you doing", return_tensors = "pt").to("cuda")

import time
start = time.time()
model.generate(**inputs, max_new_tokens = 258)
print(time.time()-start)

I already get 2 more seconds. It's a small model but should apply to big one as well, just reduced the number of layers

underskies00 commented 1 year ago

Yep indeed. Would you like to open a PR @underskies00 ?

fyi @gante if there's a reason we did not merge this for GPTNeoX maybe?

You handle it is OK,i‘m not familiar with how to open a PR, Thanks.@ArthurZucker