nltk / nltk_book

NLTK Book
http://www.nltk.org/book
406 stars 142 forks source link

Chapter 6, 1.4 code bug #194

Open Liwink opened 7 years ago

Liwink commented 7 years ago
for word in brown.words():
     word = word.lower()
     suffix_fdist[word[-1:]] += 1
     suffix_fdist[word[-2:]] += 1
     suffix_fdist[word[-3:]] += 1

If the word only has two letter, for instance of, the suffix of will be counted twice.

So we can see of has higher frequency than f.

>>> 'of'[-3:]
'of'
>>> common_suffixes[:10]
['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of']

I solve it by:

>>> for word in brown.words():
...:     word = word.lower()
...:     for i in range(min(len(word), 3)):
...:         suffix_fdist[word[-i-1:]] += 1
>>> suffix_fdist['f']
43101
>>> suffix_fdist['of']
36566