turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

How to change extend context with llama2? #215

Closed ShahZ181 closed 1 year ago

ShahZ181 commented 1 year ago

How to change extend context with llama2? I tried changing alpha values and the compress_pos_emb but the mdeol seems to blurt nonsense after 4096 tokes? is there another way to do it becasue the same method works fo llama1 but not llama2?

EyeDeck commented 1 year ago

There's no significant difference between extended context techniques for LLaMA 1 and 2 except that LLaMA 1's native max length is 2048 and LLaMA 2 is 4096. If you want to double LLaMA 1 up to 4096, you need something like --alpha 2.4 or 2.6 (not sure the exact threshold but it's a little over 2), same with doubling LLaMA 2 to 8192.

If you want to use --compress_pos_emb, you need a model finetuned for it. Here are some ExLlama-compatible LLaMA 2 quants that you can try: [LLongMA 2 7B] --compress_pos_emb 2 and length up to 8192 [LLongMA 2 13B] --compress_pos_emb 2 and length up to 8192 [LLongMA 2 13B 16k]: --compress_pos_emb 4 and length up to 16384

turboderp commented 1 year ago

Closing this. Feel free to reopen if necessary.