Discussed with Yuri today (7/08/24). From the meeting notes in April (23/04/24), I noted down that we were considering computing perplexity using HF's evaluate library, using a baseline model like GPT-2 to serve as an "oracle" for perplexity.
Some notes for future meetings:
Picking a baseline model & general thoughts about interpretation of the metric
The approach entails that the perplexity will change based on the baseline model? For instance using a model that has seen much more data than GPT-2 may produce lower perplexity scores than GPT-2 for the same text.
Therefore the interpretation would not be about whether the text has high or low perplexity in general, but rather whether the models (and humans) have higher or lower perplexity relative to each other.
With that being said, I'm still unsure about the importance of choosing a model (should I just run with GPT-2?).
Plan for Entropy
Planning to just compute entropy by taking the log of perplexity given that they are directly related. Formula here.
Note that the formula in the link above expresses perplexity as $\text{Perplexity(X)}=2^{H(X)}$, but the HF readme explains that the perplexity "is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base e." (see link). Therefore I'm just taking np.log(perplexity) to compute it rather than np.log2(perplexity)).
But which units are these perplexities and entropy scores in?
Update: the manually computed perplexities are so different from those of textdescriptives using GPT-2. I guess GPT-2 is more perplexed than the spacy metrics?
code
```python
def convert_to_entropy(perplexity: float):
'''
Compute entropy from perplexity
(since HF's perplexity is "defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`."), we just take log(perplexity) to get entropy.
'''
return np.log(perplexity)
def compute_perplexity(texts:list, model_id:str = "gpt2", batch_size:int = 1):
'''
Compute perplexity
This perplexity "is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base `e`."
source: https://huggingface.co/spaces/evaluate-measurement/perplexity/blob/main/README.md
Args:
texts: list of texts
model_id: model id
batch_size: batch size for processing
'''
perplexity = load("perplexity", module_type="metric")
perplexity_scores = perplexity.compute(
predictions=texts,
model_id=model_id,
add_start_token=True, # (default to be able to compute perplexity of first token see: https://github.com/huggingface/evaluate/blob/main/metrics/perplexity/perplexity.py)
batch_size=batch_size
)
return perplexity_scores
```
Discussed with Yuri today (7/08/24). From the meeting notes in April (23/04/24), I noted down that we were considering computing perplexity using HF's
evaluate
library, using a baseline model like GPT-2 to serve as an "oracle" for perplexity.Some notes for future meetings:
Picking a baseline model & general thoughts about interpretation of the metric
The approach entails that the perplexity will change based on the baseline model? For instance using a model that has seen much more data than GPT-2 may produce lower perplexity scores than GPT-2 for the same text.
Therefore the interpretation would not be about whether the text has high or low perplexity in general, but rather whether the models (and humans) have higher or lower perplexity relative to each other.
With that being said, I'm still unsure about the importance of choosing a model (should I just run with GPT-2?).
Plan for Entropy
Planning to just compute entropy by taking the log of perplexity given that they are directly related. Formula here.
Note that the formula in the link above expresses perplexity as $\text{Perplexity(X)}=2^{H(X)}$, but the HF readme explains that the perplexity "is defined as the exponentiated average negative log-likelihood of a sequence, calculated with exponent base
e
." (see link). Therefore I'm just takingnp.log(perplexity)
to compute it rather than np.log2(perplexity)).But which units are these perplexities and entropy scores in?