tomaarsen / attention_sinks

Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
https://huggingface.co/blog/tomaarsen/attention-sinks
Apache License 2.0
650 stars 41 forks source link

Add Yi support + benchmark results #27

Closed MekkCyber closed 8 months ago

MekkCyber commented 9 months ago

I noticed that there is no implementation of mpt_pos_shift_attention_forward, I know it's not necessary for the code knowing that no changes are made because there is no positional encoding, however, for consistency I think it's better to have it. Feel free to accept this pull request or not :). I will try working on adding other models to the library. Thank you for your time.

MekkCyber commented 9 months ago

Hello @tomaarsen

Do you have any suggestions about models to implement attention_sinks for ?

tomaarsen commented 9 months ago

Perhaps the very recent Yi models?

MekkCyber commented 9 months ago

i tried to add Yi support, i think the Yi tokenizer is not integrated yet in AutoTokenizer, so to test it i used the code provided for YiTokenizer, with tokenizer.model as a vocab_file. If you have any remark please let me know.


model = AutoModelForCausalLM.from_pretrained(
    model_id,
    # for efficiency:
    device_map="auto",
    torch_dtype=torch.float16,
    # `attention_sinks`-specific arguments:
    attention_sink_size=4,
    attention_sink_window_size=252,  # <- Low for the sake of faster generation
    trust_remote_code=True,
)
model.eval()
tokenizer = YiTokenizer('tokenizer.model')
tokenizer.pad_token_id = tokenizer.eos_token_id
tomaarsen commented 8 months ago

Hello!

Apologies for delaying this for a while. Regarding the tokenizer, I think that is because the AutoTokenizer also requires trust_remote_code=True, e.g.:

model_id = "01-ai/Yi-6B"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    # for efficiency:
    device_map="auto",
    torch_dtype=torch.float16,
    # `attention_sinks`-specific arguments:
    attention_sink_size=4,
    attention_sink_window_size=252,  # <- Low for the sake of faster generation
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token_id = tokenizer.eos_token_id

And then it should be fine!

I've added some experiments, ran them, and put the results in the README. I also credited you for this addition there!