GPTNeoXRotaryEmbedding has a defect when using sin/cos cache

underskies00 commented 1 year ago

System Info

transformers version: 4.31.0
Platform: Linux-5.10.112-005.ali5000.alios7.x86_64-x86_64-with-glibc2.17
Python version: 3.8.13
Huggingface_hub version: 0.12.1
PyTorch version (GPU?): 1.13.1+cu117 (False)
Tensorflow version (GPU?): 2.11.0 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

Text models:@ArthurZucker and @younesbelkada

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

in transformers/models/gpt_neox/modeling_gpt_neox.py, line320:

return self.cos_cached[:seq_len, ...].to(x.device), self.sin_cached[:seq_len, ...].to(x.device)

lack 2 dimensions before seq_lenth. This will not cause bug, because seq_lenth is always larger than 1, but it will cause a defect use of cache by use the whole cache ever, maybe lead to poor performance when inference?

Expected behavior

Maybe the right code should be:

return self.cos_cached[:, :, :seq_len, ...].to(x.device), self.sin_cached[:, :, :seq_len, ...].to(x.device)

ydshieh commented 1 year ago

Looks like you are right, as this is the case for models in llama, where we have

        return (
            self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
            self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
        )

@ArthurZucker could you like to double confirm here?

ArthurZucker commented 1 year ago

Yep indeed. Would you like to open a PR @underskies00 ?

fyi @gante if there's a reason we did not merge this for GPTNeoX maybe?

gante commented 1 year ago

@ArthurZucker looking at the git blame, this has been present since the 1st commit :D Possibly flew under the radar.

Technically harmless (since seq_len is almost always > batch size), but in need of fixing.

ArthurZucker commented 1 year ago

(Actually for a simple case like this:

from transformers import AutoTokenizer, GPTNeoXForCausalLM, AutoConfig

config = AutoConfig.from_pretrained("EleutherAI/gpt-neox-20b", num_hidden_layers = 5)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b", )
model = GPTNeoXForCausalLM(config).cuda()

inputs = tokenizer("Hey how are you doing", return_tensors = "pt").to("cuda")

import time
start = time.time()
model.generate(**inputs, max_new_tokens = 258)
print(time.time()-start)

I already get 2 more seconds. It's a small model but should apply to big one as well, just reduced the number of layers

underskies00 commented 1 year ago

Yep indeed. Would you like to open a PR @underskies00 ?

fyi @gante if there's a reason we did not merge this for GPTNeoX maybe?

You handle it is OK，i‘m not familiar with how to open a PR, Thanks.@ArthurZucker

huggingface / transformers