kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 510 forks source link

Score interpretation #150

Closed numericlee closed 6 years ago

numericlee commented 6 years ago

New to KenLM and NLP

Thank you for this excellent tool

Struggling with the interpretation of model.score Model trained on the Brown Corpus with trigrams o=3 The first phrase is verbatim from the Corpus and gets a score of -57.34

Does that mean the probabilty the phrase follows the language model is 1- 10^57 or 1 -e^57?

If so, it would suggest that the probabilty associated with all the other phrases is still well above 90%

Or is the score just a directional gauge. (I have another question I will submit separately)

>>>import kenlm
>>>model = kenlm.LanguageModel('FMQ/brown.klm')
>>>model.score('The jury said it found the court "has incorporated into its operating procedures the recommendations" of two previous grand juries, the Atlanta Bar Association and an interim citizens committee.')
-57.3411865234375
>>> model.score("James Earl Carter Jr. (born October 1, 1924) is an American politician who served as the 39th president of the United States from 1977 to 1981")
-53.679893493652344
>>> model.score('He estado en Estados Unidos durante 2 años')
-24.930564880371094
>>> model.score('xyxyx 123 pastrami bacon')
-18.20257568359375
>>> model.score("1234 Main Street, Pittsburgh, PA 15207")
-10.831670761108398
>>> model.score('Supercalifragilisticexpialidocious')
-6.057686805725098
kpu commented 6 years ago

They are log10 probabilities. So the probability of generating that sentence from the model is 10^-57.34.

You should probably use a tokenizer like https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl rather than raw text. And also consider lowercasing.