Bigcode architecture - Githubissues

Hello!

I looked into this, and it seems that GPT_BigCode uses "vanilla" position embeddings, i.e. passing position IDs through an nn.Embedding with max_position_embeddings entries, and then concatenating the position embeddings resulting from this with the input embeddings: https://github.com/huggingface/transformers/blob/093848d3ccf3884caf048718b6bae833da0edb94/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py#L633-L634

This approach mirrors that of gpt2, for which I haven't figured out an approach to extend the fluency. In short, any token position higher than max_position_embeddings simply breaks, as there's no trained position embedding for it. In GPTBigCode, the n_positions parameter is mapped to max_position_embeddings, so the WizardLM/WizardCoder model can only do 8192 positions and no more: https://github.com/huggingface/transformers/blob/093848d3ccf3884caf048718b6bae833da0edb94/src/transformers/models/gpt_bigcode/configuration_gpt_bigcode.py#L96

It would be interesting to see if attention sinks can be added to models with vanilla position embeddings, but I don't have the time needed to investigate I'm afraid. My theory: Yes, it's possible, but it requires recomputing the entire sequence for every new token generation, because the entire key/value cache invalidates itself if all the old position embeddings must be replaced with shifted position embeddings.

I'm afraid I'll have to close this for now.

Tom Aarsen

tomaarsen / attention_sinks

Bigcode architecture #21