Closed underskies00 closed 11 months ago
Looks like you are right, as this is the case for models in llama
, where we have
return (
self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
)
@ArthurZucker could you like to double confirm here?
Yep indeed. Would you like to open a PR @underskies00 ?
fyi @gante if there's a reason we did not merge this for GPTNeoX maybe?
@ArthurZucker looking at the git blame
, this has been present since the 1st commit :D Possibly flew under the radar.
Technically harmless (since seq_len is almost always > batch size), but in need of fixing.
(Actually for a simple case like this:
from transformers import AutoTokenizer, GPTNeoXForCausalLM, AutoConfig
config = AutoConfig.from_pretrained("EleutherAI/gpt-neox-20b", num_hidden_layers = 5)
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b", )
model = GPTNeoXForCausalLM(config).cuda()
inputs = tokenizer("Hey how are you doing", return_tensors = "pt").to("cuda")
import time
start = time.time()
model.generate(**inputs, max_new_tokens = 258)
print(time.time()-start)
I already get 2 more seconds. It's a small model but should apply to big one as well, just reduced the number of layers
Yep indeed. Would you like to open a PR @underskies00 ?
fyi @gante if there's a reason we did not merge this for GPTNeoX maybe?
You handle it is OK,i‘m not familiar with how to open a PR, Thanks.@ArthurZucker
System Info
transformers
version: 4.31.0Who can help?
Text models:@ArthurZucker and @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
in transformers/models/gpt_neox/modeling_gpt_neox.py, line320:
lack 2 dimensions before seq_lenth. This will not cause bug, because seq_lenth is always larger than 1, but it will cause a defect use of cache by use the whole cache ever, maybe lead to poor performance when inference?
Expected behavior
Maybe the right code should be: