Closed numericlee closed 6 years ago
They are log10 probabilities. So the probability of generating that sentence from the model is 10^-57.34.
You should probably use a tokenizer like https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl rather than raw text. And also consider lowercasing.
New to KenLM and NLP
Thank you for this excellent tool
Struggling with the interpretation of model.score Model trained on the Brown Corpus with trigrams o=3 The first phrase is verbatim from the Corpus and gets a score of -57.34
Does that mean the probabilty the phrase follows the language model is 1- 10^57 or 1 -e^57?
If so, it would suggest that the probabilty associated with all the other phrases is still well above 90%
Or is the score just a directional gauge. (I have another question I will submit separately)