nltk / nltk

NLTK Source
https://www.nltk.org
Apache License 2.0
13.41k stars 2.86k forks source link

Unable to evaluate perplexity and entropy of KneserNeyInterpolated and WittenBellInterpolated Language Model #2727

Open cprachaseree opened 3 years ago

cprachaseree commented 3 years ago

On nltk 3.6.2 google colab environment The code

lm = KneserNeyInterpolated(n)
lm.fit(train_ngrams, vocab)

test_sentences = list(flatten(pad_both_ends(sent, n=n) for sent in test_corpus))
entropy = lm.entropy(test_sentences)

gives error:

/usr/local/lib/python3.7/dist-packages/nltk/lm/smoothing.py in alpha_gamma(self, word, context) 49 def alpha_gamma(self, word, context): 50 prefix_counts = self.counts[context] ---> 51 prefix_total_ngrams = prefix_counts.N() 52 alpha = max(prefix_counts[word] - self.discount, 0.0) / prefix_total_ngrams 53 gamma = (

AttributeError: 'int' object has no attribute 'N'

MLE and Laplace works fine. Thank you.

jayralencar commented 2 years ago

Any solution?

iliakur commented 2 years ago

I think the issue was closed before anyone had time to sit down and investigate.

I'll reopen, but won't have time to investigate it until november most likely.

tomaarsen commented 2 years ago

@iliakur Note, the issue thrown is from before #2709. That said, the issue might persist. The following method will be called with KneserNey: https://github.com/nltk/nltk/blob/fb2e2f21728b4dc0df710a00ec59d0a3797d769e/nltk/lm/smoothing.py#L100-L109

self.counts is a NgramCounter, which is accessed like self.counts[context]. This calls https://github.com/nltk/nltk/blob/fb2e2f21728b4dc0df710a00ec59d0a3797d769e/nltk/lm/counter.py#L144-L151

This returns very different results based on the type of context. If given an integer or a sequence, it will return a FreqDist, on which .N() can be called. However, if context is a string, then it simply returns an integer. I believe this is the cause of the error. After all, you can't call .N() on an int.

That said, I don't know when alpha_gamma is ever called with context as a string, instead of a sequence of strings - I haven't been able to find a case. I haven't been able to reproduce the exact issue, or I would be able to track this down more easily.

That said, perhaps this comment will help you out in some way, Ilia.