Open halixness opened 4 months ago
Perplexity is a measure which is dependent on the model used to calculate it. Specifically, the formula for perplexity is entirely dependent on the probability functions for a given model.
So seeing different perplexities for different models is entirely expected.
You could even argue that the higher perplexity from GPT2 compared to Llama-2-7B for the human dataset of wikitext is a reflection that Llama-2-7B is a better model.
I tested also LLaMa2-70b and the perplexity on wikitext is around 22. Shouldn't I expect a better performance for both the 70b and the 7b variant? Is LLaMa2's training distribution that far?
I don't know how the results from the initial paper calculated their perplexity, whether they used HuggingFace's metric or not, but given the discussion is based on the varying window size, that could be part of the problem.
Are you confident that you are using the right sliding window (Context window) size?
If you want to be really precise, you could write your own perplexity measure. There is a good guide here HuggingFace: Perplexity of Fixed-Length Models.
Also, it is not surprising that 70b is largely outperforming 7b. The parameter difference is just that large.
Hi @halixness, were you able to resolve the perplexity issue? I am also getting similar value (~56) for llama2-7b. I have tried coding up the perplexity calculations suggested by @SamSJackson and also with huggingface evaluate module, but still getting similar values.
Hello. I'm struggling with replicating the reported perplexity (~6) for LLaMa-2-7b. I am using this simple code snippet:
And I get among the results:
'mean_perplexity': 60.9764459149642
. In this tutorial it is computed "approximately" by flattening the dataset into a string and by computing the avg. sliding window perplexity. I still get a high perplexity. I tried to change the model in the code snippet toopenai-community/gpt2
and the perplexity is above 600! Does this depend on using the correct model class? Thank you for any suggestion.EDIT: I'm using the following versions