nltk / nltk_book

NLTK Book
http://www.nltk.org/book
404 stars 143 forks source link

Lexical diversity score inverted in 3.2 #208

Open KeithDonoghue opened 6 years ago

KeithDonoghue commented 6 years ago

In Section 3.2 we have:

>>> from __future__ import division
>>> def lexical_diversity(text):
...     return len(text) / len(set(text))

With the line 'Here's an equivalent definition which does the same work using multiple lines of code.'

>>> def lexical_diversity(my_text_data):
...     word_count = len(my_text_data)
...     vocab_size = len(set(my_text_data))
...     diversity_score = vocab_size / word_count
...     return diversity_score

The division here is inverted.

tmusgrove1963 commented 6 years ago

As of today Aug 3, 2018 this error is still there in chapter 2.

Note that this is wrong:

def lexical_diversity(text): ... return len(text) / len(set(text))

Lexical_diversity is supposed to be the type/token ration, but above is the inverse of that.

Note that in Chapter 1, the correct formula is given:

def lexical_diversity(text): ... return len(set(text)) / len(text)

Hope this can be fixed as it confused a bunch of my students. -T.

databysara commented 5 months ago

Chapter 1 - Section on Lexical Diversity. The definition (and hence the following calculations) for Lexical Diversity is incorrect in the Chapter 1 in the version of this book that is available on O'Reilly. But it is correct on the github version.