Open spring1915 opened 6 months ago
Great! Thanks @tomaarsen for sharing.
I have another question as you're an expert in the field. I used the standard streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
in mode.generate()
for llama2-7b in Colab, and the model streaming worked well but often stopped in the process or when I ran a second request. I also encountered this issue when inferring with the model on AWS large instances (ml.g5.48x) with DeepSpeed. Can you give me a hint into the causes? I googled but haven't found a satisfactory answer.
Hello!
mistralai/Mistral-7B-Instruct-v0.2
should be supported in the same way thatMistral-7B-v0.1
is :)Also, consider using the new Attention Sinks implementation in
transformers
directly: TheSinkCache
. See how to use it here: https://colab.research.google.com/drive/1S0oIPaqxAVp0oWEwTadhZXDjhWiTyF12?usp=sharingYou should be able to replace
HuggingFaceH4/zephyr-7b-beta
withmistralai/Mistral-7B-Instruct-v0.2
.