Anyone else see this issue when using test_inference.py to compute perplexity for any Llama-2 based model that uses rope_scale > 1.0 and context > 4096?
This happens whether I try GPTQ, EXL2 or fp16 70B models (> 1000 perplexity). Sometimes I get decent PPL below 4K context, sometimes even that seems suspiciously high.
However if I look at the EXL2 quantization logs both the measurement and final perplexities seem reasonable (a 3.x number @ max context length).
If I use Exllamav2_hf in oobabooga to measure perplexity in the Web UI for those same models, it works fine and I get similar 3.x numbers at max context length.
It would be nice to have a stand-alone way to measure perplexity in test_inference.py.
Just wanted to add since I made this post last year, I'm pretty sure I've evaluated several long-context models at reasonable PPL, so possibly this was fixed at some point.
Anyone else see this issue when using test_inference.py to compute perplexity for any Llama-2 based model that uses rope_scale > 1.0 and context > 4096?
This happens whether I try GPTQ, EXL2 or fp16 70B models (> 1000 perplexity). Sometimes I get decent PPL below 4K context, sometimes even that seems suspiciously high.
However if I look at the EXL2 quantization logs both the measurement and final perplexities seem reasonable (a 3.x number @ max context length). If I use Exllamav2_hf in oobabooga to measure perplexity in the Web UI for those same models, it works fine and I get similar 3.x numbers at max context length.
It would be nice to have a stand-alone way to measure perplexity in test_inference.py.