Best alpha/cpe values while using extended context length on Exllama?

For linear scaled models, use whatever it was finetuned with. So, SuperHOT 8k should be cpe 4, SuperHOT 16k should be cpe 8, and so on. Usually [finetune length] / [base model length] = cpe Perplexity will test lower if you use a lower cpe, but in practice it doesn't work. Quick and reliable example, try asking SuperHOT 8k what year something occurred with both values, cpe 4 will probably get it right if the model knows the answer, while cpe 2 will almost always screw it up.

NTK scaling is more complicated. Unlike linear scaling, it does not need a finetune to work. There are several different variants of NTK scaling now that all work a little differently. "NTKv1" (--alpha) like ExLlama currently uses is kinda hard to predict, a ppl test is pretty reliable if you know your test is actually calculating ppl near the end of the context window and not on shorter sequences. For example, the ppl benchmark built into ExLlama tests an average sequence length of like 150, which is obviously nowhere near the point the model starts losing it if max length is set too high. Otherwise if you're using it on a regular base length model, you can just work out empirically where the model starts freaking out and then dial alpha up a bit, or context down a bit. Do note that alpha is a float, so you can pass in, like, 2.5 or whatever. Then there are NTK-scaled finetunes too, and those get even more complicated. Should probably consult whoever did the finetune for those.

Oh, and you probably shouldn't mix both --compress_pos_emb and --alpha.

turboderp / exllama

Best alpha/cpe values while using extended context length on Exllama? #185