tomaarsen / attention_sinks

Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
https://huggingface.co/blog/tomaarsen/attention-sinks
Apache License 2.0
650 stars 41 forks source link

Bigcode architecture #21

Closed selimsandal closed 9 months ago

selimsandal commented 9 months ago

NotImplementedError: attention_sinks does not support models with the gpt_bigcode architecture at this time.

While trying to run WizardLM/WizardCoder-3B-V1.0 I got this error. Is it possible to add bigcode arch support?

tomaarsen commented 9 months ago

Hello!

I looked into this, and it seems that GPT_BigCode uses "vanilla" position embeddings, i.e. passing position IDs through an nn.Embedding with max_position_embeddings entries, and then concatenating the position embeddings resulting from this with the input embeddings: https://github.com/huggingface/transformers/blob/093848d3ccf3884caf048718b6bae833da0edb94/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py#L633-L634

This approach mirrors that of gpt2, for which I haven't figured out an approach to extend the fluency. In short, any token position higher than max_position_embeddings simply breaks, as there's no trained position embedding for it. In GPTBigCode, the n_positions parameter is mapped to max_position_embeddings, so the WizardLM/WizardCoder model can only do 8192 positions and no more: https://github.com/huggingface/transformers/blob/093848d3ccf3884caf048718b6bae833da0edb94/src/transformers/models/gpt_bigcode/configuration_gpt_bigcode.py#L96

It would be interesting to see if attention sinks can be added to models with vanilla position embeddings, but I don't have the time needed to investigate I'm afraid. My theory: Yes, it's possible, but it requires recomputing the entire sequence for every new token generation, because the entire key/value cache invalidates itself if all the old position embeddings must be replaced with shifted position embeddings.

I'm afraid I'll have to close this for now.