nltk / nltk_book

NLTK Book
http://www.nltk.org/book
403 stars 143 forks source link

Chapt 5 Section 2.5 Verbs section correction #191

Open AncientZygote opened 7 years ago

AncientZygote commented 7 years ago

There is an error in the NLTK Book updated for Python 3 and NLTK 3, Natural Language Processing with Python; Chapter 5. Categorizing and Tagging Words; Section 2.5 Verbs:

"To clarify the distinction between VBD (past tense) and VBN (past participle), let's find words which can be both VBD and VBN, and see some surrounding text:

[w for w in cfd1.conditions() if 'VBD' in cfd1[w] and 'VBN' in cfd1[w]] ['Asked', 'accelerated', 'accepted', 'accused', 'acquired', 'added', 'adopted', ...]"

The generator/comprehension bracketed above does not produce any result because cfd1 must be regenerated with the standard tagset (rather than the previously assigned universal tagset) of the treebank.tagged_words() corpus. Insert the following line prior to the bracketed line:

cfd1 = nltk.ConditionalFreqDist(wsj)

The corpus variable wsj was reassigned to the standard tagset just prior to this example so only this additional line is required to rebuild the conditional frequency distribution with the standard tagset so the events 'VBD' and 'VBN' can be found in the distribution (instead of merely 'VERB').

A minor additional detail is that the example result will not be alphabetic order (as shown in the book text) unless the bracketed comprehension is wrapped in the sorted() function.

pjhinton commented 4 years ago

Another possible approach might be to just use set and its operators to do the work, using the ConditionalFreqDist created and stored in cfd2.

sorted(list(set(cfd2['VBN'].keys()) & set(cfd2['VBD'].keys())))