test_inference.py PPL evaluation for >4K context

Anyone else see this issue when using test_inference.py to compute perplexity for any Llama-2 based model that uses rope_scale > 1.0 and context > 4096?

This happens whether I try GPTQ, EXL2 or fp16 70B models (> 1000 perplexity). Sometimes I get decent PPL below 4K context, sometimes even that seems suspiciously high.

However if I look at the EXL2 quantization logs both the measurement and final perplexities seem reasonable (a 3.x number @ max context length). If I use Exllamav2_hf in oobabooga to measure perplexity in the Web UI for those same models, it works fine and I get similar 3.x numbers at max context length.

It would be nice to have a stand-alone way to measure perplexity in test_inference.py.

turboderp / exllamav2

test_inference.py PPL evaluation for >4K context #68