tomaarsen / attention_sinks

Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
https://huggingface.co/blog/tomaarsen/attention-sinks
Apache License 2.0
650 stars 41 forks source link

Support newer versions of mistral (e.g. mistralai/Mistral-7B-Instruct-v0.2)? #41

Open spring1915 opened 6 months ago

tomaarsen commented 6 months ago

Hello!

mistralai/Mistral-7B-Instruct-v0.2 should be supported in the same way that Mistral-7B-v0.1 is :)

Also, consider using the new Attention Sinks implementation in transformers directly: The SinkCache. See how to use it here: https://colab.research.google.com/drive/1S0oIPaqxAVp0oWEwTadhZXDjhWiTyF12?usp=sharing

You should be able to replace HuggingFaceH4/zephyr-7b-beta with mistralai/Mistral-7B-Instruct-v0.2.

spring1915 commented 6 months ago

Great! Thanks @tomaarsen for sharing.

I have another question as you're an expert in the field. I used the standard streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) in mode.generate() for llama2-7b in Colab, and the model streaming worked well but often stopped in the process or when I ran a second request. I also encountered this issue when inferring with the model on AWS large instances (ml.g5.48x) with DeepSpeed. Can you give me a hint into the causes? I googled but haven't found a satisfactory answer.