Are the log probabilities comparable across language models?

smilli / berkeleylm

Automatically exported from code.google.com/p/berkeleylm

1 stars 1 forks source link

Are the log probabilities comparable across language models? #15

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago

I am training multiple language models using Kneser-Ney on different corpuses, 
and then trying to classify new sentences by scoring them with each language 
model and taking the highest score (Naive Bayes). 

Does this work using this library's Kneser-Ney smoothing? As in, are the 
distributions properly normalized so that I can compare scores across language 
models?

Original issue reported on code.google.com by b...@parakhi.com on 18 Jul 2013 at 12:45

GoogleCodeExporter commented 9 years ago

Yes, should work fine.

Original comment by adpa...@google.com on 18 Jul 2013 at 2:48

GoogleCodeExporter commented 9 years ago

Original comment by adpa...@gmail.com on 18 Jul 2013 at 2:49

Changed state: Invalid

GoogleCodeExporter commented 9 years ago

I get some behavior I am confused about. For instance, certain topics with very 
small vocabularies give very high log probs for sentences of words they have 
never seen before, higher than for language models that have seen them.

Original comment by b...@parakhi.com on 18 Jul 2013 at 3:00

GoogleCodeExporter commented 9 years ago

Could you give me a concrete example?

Original comment by adpa...@gmail.com on 18 Jul 2013 at 3:29