Closed sathvikn closed 3 months ago
Name | Link |
---|---|
Latest commit | e3c568df6ded7ada6065da7b1beb93b327383f0e |
Latest deploy log | https://app.netlify.com/sites/pyminicons/deploys/66a353fdf154ec0008648fa3 |
I could also add a warning if a sentence fails to get aggregated properly & just return the output of token_score
so this fails elegantly.
As discussed, I implemented the warning and checked if a token is the beginning of sentence token. I tested this with both Llama and GPT2.
The current implementation sums surprisals over individual subword tokens with a new method called
word_score
, applicable to allLMScorer
objects. It currently splits sentences on whitespaces and punctuation, we might need to add workarounds for special characters. This doesn't implement the fixes suggested in https://github.com/tpimentelms/probability-of-a-word/tree/main, but I thought I could make this PR as a starting point to get word-level measures.Testing: I ran it with the example that splits the text into multiple subword tokens (confirmed for GPT2 and RoBERTa).
Please let me know if you have any further suggestions/changes. Thanks in advance!