Closed selimsandal closed 9 months ago
Hello!
I looked into this, and it seems that GPT_BigCode uses "vanilla" position embeddings, i.e. passing position IDs through an nn.Embedding
with max_position_embeddings
entries, and then concatenating the position embeddings resulting from this with the input embeddings: https://github.com/huggingface/transformers/blob/093848d3ccf3884caf048718b6bae833da0edb94/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py#L633-L634
This approach mirrors that of gpt2, for which I haven't figured out an approach to extend the fluency. In short, any token position higher than max_position_embeddings
simply breaks, as there's no trained position embedding for it. In GPTBigCode, the n_positions
parameter is mapped to max_position_embeddings
, so the WizardLM/WizardCoder model can only do 8192 positions and no more: https://github.com/huggingface/transformers/blob/093848d3ccf3884caf048718b6bae833da0edb94/src/transformers/models/gpt_bigcode/configuration_gpt_bigcode.py#L96
It would be interesting to see if attention sinks can be added to models with vanilla position embeddings, but I don't have the time needed to investigate I'm afraid. My theory: Yes, it's possible, but it requires recomputing the entire sequence for every new token generation, because the entire key/value cache invalidates itself if all the old position embeddings must be replaced with shifted position embeddings.
I'm afraid I'll have to close this for now.
NotImplementedError:
attention_sinks
does not support models with thegpt_bigcode
architecture at this time.While trying to run WizardLM/WizardCoder-3B-V1.0 I got this error. Is it possible to add bigcode arch support?