turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.64k stars 279 forks source link

test_inference.py PPL evaluation for >4K context #68

Closed grimulkan closed 7 months ago

grimulkan commented 1 year ago

Anyone else see this issue when using test_inference.py to compute perplexity for any Llama-2 based model that uses rope_scale > 1.0 and context > 4096?

This happens whether I try GPTQ, EXL2 or fp16 70B models (> 1000 perplexity). Sometimes I get decent PPL below 4K context, sometimes even that seems suspiciously high.

However if I look at the EXL2 quantization logs both the measurement and final perplexities seem reasonable (a 3.x number @ max context length). If I use Exllamav2_hf in oobabooga to measure perplexity in the Web UI for those same models, it works fine and I get similar 3.x numbers at max context length.

It would be nice to have a stand-alone way to measure perplexity in test_inference.py.

grimulkan commented 7 months ago

Just wanted to add since I made this post last year, I'm pretty sure I've evaluated several long-context models at reasonable PPL, so possibly this was fixed at some point.