huggingface / evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
https://huggingface.co/docs/evaluate
Apache License 2.0
1.9k stars 235 forks source link

Is perplexity correctly computed? #560

Open halixness opened 4 months ago

halixness commented 4 months ago

Hello. I'm struggling with replicating the reported perplexity (~6) for LLaMa-2-7b. I am using this simple code snippet:

import evaluate
import datasets

perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = datasets.load_dataset("wikitext",
                                    "wikitext-2-raw-v1",
                                    split="test")["text"]
input_texts = [s for s in input_texts if s!='']
results = perplexity.compute(
    model_id="sharpbai/Llama-2-7b-hf",
    batch_size=4,
    predictions=input_texts
)
print(results)

And I get among the results: 'mean_perplexity': 60.9764459149642. In this tutorial it is computed "approximately" by flattening the dataset into a string and by computing the avg. sliding window perplexity. I still get a high perplexity. I tried to change the model in the code snippet to openai-community/gpt2 and the perplexity is above 600! Does this depend on using the correct model class? Thank you for any suggestion.

EDIT: I'm using the following versions

transformers              4.38.2
evaluate                  0.4.1
datasets                  2.18.0 
SamSJackson commented 4 months ago

Perplexity is a measure which is dependent on the model used to calculate it. Specifically, the formula for perplexity is entirely dependent on the probability functions for a given model.

So seeing different perplexities for different models is entirely expected.

You could even argue that the higher perplexity from GPT2 compared to Llama-2-7B for the human dataset of wikitext is a reflection that Llama-2-7B is a better model.

halixness commented 4 months ago

I tested also LLaMa2-70b and the perplexity on wikitext is around 22. Shouldn't I expect a better performance for both the 70b and the 7b variant? Is LLaMa2's training distribution that far?

SamSJackson commented 4 months ago

I don't know how the results from the initial paper calculated their perplexity, whether they used HuggingFace's metric or not, but given the discussion is based on the varying window size, that could be part of the problem.

Are you confident that you are using the right sliding window (Context window) size?

If you want to be really precise, you could write your own perplexity measure. There is a good guide here HuggingFace: Perplexity of Fixed-Length Models.

Also, it is not surprising that 70b is largely outperforming 7b. The parameter difference is just that large.

anu7699 commented 2 months ago

Hi @halixness, were you able to resolve the perplexity issue? I am also getting similar value (~56) for llama2-7b. I have tried coding up the perplexity calculations suggested by @SamSJackson and also with huggingface evaluate module, but still getting similar values.