mlc-ai / llm-perf-bench

Apache License 2.0
109 stars 12 forks source link

Perplexity and memory use comparisons would be useful #9

Open JohannesGaessler opened 1 year ago

JohannesGaessler commented 1 year ago

Currently the README does not necessarily provide a like-for-like comparison because 4 bit quantizations can be of different quality depending on the implementation details. For example, in llama.cpp q4_0 is faster than q4_K_M but the quantization format is less efficient in terms of size. So it would be useful to include measurements for the memory usage as well as measure for the output quality (e.g. perplexity on a large corpus of text) to put the speed numbers into context.

junrushao commented 1 year ago

Will do

shwu-nyunai commented 7 months ago

+1 on perplexity. Any timeline on this? thanks.

JohannesGaessler commented 7 months ago

I don't know about the timeline but by now llama.cpp has support for the calculation of the KL divergence relative to FP16, see https://github.com/ggerganov/llama.cpp/pull/5076 . This would be a better metric for comparison than perplexity.

shwu-nyunai commented 7 months ago

sure. I am using the perplexity scores for a paper, hence I need ppl values. Also, how to go about actually calculating the scores? I doubt I'll be able to directly run the llama script to get the scores on mlc models. Have been trying to find out a way to change mlc_chat but no progress so far.

If you have the scripts that you had used on mlc quantised models, it'd be of great help.

I'm trying to capture out the generated logits for a prompt input. no luck.