rbroc / echo

A Scalable and Explainable Approach to Discriminating Between Human and Artificially Generated Text
https://cc.au.dk/en/clai/current-projects/a-scalable-and-explainable-approach-to-discriminating-between-human-and-artificially-generated-text
2 stars 1 forks source link

Computing Perplexity outside of TextDescriptives (and Entropy) #62

Open MinaAlmasi opened 1 month ago

MinaAlmasi commented 1 month ago

Discussed with Yuri today (7/08/24). From the meeting notes in April (23/04/24), I noted down that we were considering computing perplexity using HF's evaluate library, using a baseline model like GPT-2 to serve as an "oracle" for perplexity.

Some notes for future meetings:

Picking a baseline model & general thoughts about interpretation of the metric

The approach entails that the perplexity will change based on the baseline model? For instance using a model that has seen much more data than GPT-2 may produce lower perplexity scores than GPT-2 for the same text.

Therefore the interpretation would not be about whether the text has high or low perplexity in general, but rather whether the models (and humans) have higher or lower perplexity relative to each other.

With that being said, I'm still unsure about the importance of choosing a model (should I just run with GPT-2?).

Plan for Entropy

Planning to just compute entropy by taking the log of perplexity given that they are directly related. Formula here.

Note that the formula in the link above expresses perplexity as $\text{Perplexity(X)}=2^{H(X)}$, but the HF readme explains that the perplexity "is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base e." (see link). Therefore I'm just taking np.log(perplexity) to compute it rather than np.log2(perplexity)).

But which units are these perplexities and entropy scores in?

MinaAlmasi commented 1 month ago

Update: the manually computed perplexities are so different from those of textdescriptives using GPT-2. I guess GPT-2 is more perplexed than the spacy metrics?

Screenshot 2024-08-14 at 14 58 06
code ```python def convert_to_entropy(perplexity: float): ''' Compute entropy from perplexity (since HF's perplexity is "defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`."), we just take log(perplexity) to get entropy. ''' return np.log(perplexity) def compute_perplexity(texts:list, model_id:str = "gpt2", batch_size:int = 1): ''' Compute perplexity This perplexity "is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`." source: https://huggingface.co/spaces/evaluate-measurement/perplexity/blob/main/README.md Args: texts: list of texts model_id: model id batch_size: batch size for processing ''' perplexity = load("perplexity", module_type="metric") perplexity_scores = perplexity.compute( predictions=texts, model_id=model_id, add_start_token=True, # (default to be able to compute perplexity of first token see: https://github.com/huggingface/evaluate/blob/main/metrics/perplexity/perplexity.py) batch_size=batch_size ) return perplexity_scores ```