microsoft / T-MAC

Low-bit LLM inference on CPU with lookup table
MIT License
588 stars 44 forks source link

The perplexity tool returns abnormal values #70

Open ppp-max opened 1 week ago

ppp-max commented 1 week ago

Hello,Sorry to bother you.

T tested the PPLs of llama.cpp and T-MAC is abnormal, which values are 110682 and 53515, so big. But we know that the normal value should be very small. So then I try to test the latest llama.cpp( https://github.com/ggerganov/llama.cpp,)'s PPL(about 6~9), which is nomal.

Image https://github.com/ggerganov/llama.cpp/blob/master/examples/perplexity/README.md

Have you tested the PPL data, or do it need to do additional processing on the PPL data?

Thank you for your assistance!

QingtaoLi1 commented 1 week ago

@ppp-max Which models are you testing? And do you check llama-cli to see whether the output tokens are normal?

Recently, we find that some EfficientQAT Llama-2-7b models has vocab_size=32001, but the meta/Llama-2-7b has vocab_size=32000; thus, the perplexity becomes abnormally high. After hacking and forcing it to be 32000 (removing the last one), we got correct PPL numbers. You can see our PR to llama.cpp for the numbers.

Image

ppp-max commented 1 week ago

I used models are llama-2-7b-chat.Q4_0.gguf and llama-2-7b-chat.Q2_K.gguf, which downloaded from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF. And I getted different PPL when testing the same gguf with different llama.cpp (https://github.com/ggerganov/llama.cpp and https://github.com/kaleid-liner/llama.cpp).

And how to hack and force vocab_size to 32000? Thanks.

QingtaoLi1 commented 1 week ago

I used models are llama-2-7b-chat.Q4_0.gguf and llama-2-7b-chat.Q2_K.gguf, which downloaded from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF. And I getted different PPL when testing the same gguf with different llama.cpp (https://github.com/ggerganov/llama.cpp and https://github.com/kaleid-liner/llama.cpp).

And how to hack and force vocab_size to 32000? Thanks.

@ppp-max I think it's better to use non-chat version of models to test PPL. From our test, chat version will give slightly higher PPL numbers, but still below 10. We've tested a Q4_0 model (downloaded from meta/Llama-2-7b and quantized using llama-quantize), in which the origin llama.cpp, kaleid-liner llama.cpp and T-MAC got almost the same PPL (5.961764, 5.962298, 5.962719).

For the vocab_size problem, have you checked the llama-cli output tokens? If the output are random tokens instead of human sentences, probably you should firstly check other parts, e.g. the configuration, build, command options, etc. If the generated tokens are normal, you can check model.vocab.n_vocab or model.hparams.n_vocab or the weight tensor shapes after loading the model to see if the problem is indeed vocab_size.

QingtaoLi1 commented 3 days ago

@ppp-max I notice that your issue https://github.com/microsoft/T-MAC/issues/61 mentioned that you used Llama-2-7b-EfficientQAT-w2g128-GPTQ and Llama-2-7b-EfficientQAT-w4g128-GPTQ. They are where I find the vocab size problem.

My hacking is quite tricky and temporary, so tbh I don't wanna put it here. But you can use it as a temprary solution like me. I forcely set model.hparams.n_vocab and model.vocab.n_vocab to be 32000 after loading model hparams and vocab, and resize model.vocab.id_to_token to 32000. And then when reading tensor info in ggml.c, change the tensor shape if (info->ne[j] == 32001) { info->ne[j] = 32000; }

Hope these can help you.