turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

Best alpha/cpe values while using extended context length on Exllama? #185

Closed nikshepsvn closed 1 year ago

nikshepsvn commented 1 year ago

Curious what values work best for y'all when using SuperHOT and not SuperHOT models with exllama, trying an 8K superhot model with cpe 4 works great but for some reason it starts losing coherence after a while when using alpha 4, also how does alpha and cpe interact with each other? Is ppl the best way to optimize the parameters?

Tbh just some discussion on what y'all have found works best (would love info on 8k/16k contexts)

EyeDeck commented 1 year ago

For linear scaled models, use whatever it was finetuned with. So, SuperHOT 8k should be cpe 4, SuperHOT 16k should be cpe 8, and so on. Usually [finetune length] / [base model length] = cpe Perplexity will test lower if you use a lower cpe, but in practice it doesn't work. Quick and reliable example, try asking SuperHOT 8k what year something occurred with both values, cpe 4 will probably get it right if the model knows the answer, while cpe 2 will almost always screw it up.

NTK scaling is more complicated. Unlike linear scaling, it does not need a finetune to work. There are several different variants of NTK scaling now that all work a little differently. "NTKv1" (--alpha) like ExLlama currently uses is kinda hard to predict, a ppl test is pretty reliable if you know your test is actually calculating ppl near the end of the context window and not on shorter sequences. For example, the ppl benchmark built into ExLlama tests an average sequence length of like 150, which is obviously nowhere near the point the model starts losing it if max length is set too high. Otherwise if you're using it on a regular base length model, you can just work out empirically where the model starts freaking out and then dial alpha up a bit, or context down a bit. Do note that alpha is a float, so you can pass in, like, 2.5 or whatever. Then there are NTK-scaled finetunes too, and those get even more complicated. Should probably consult whoever did the finetune for those.

Oh, and you probably shouldn't mix both --compress_pos_emb and --alpha.