sloria / TextBlob

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
https://textblob.readthedocs.io/
MIT License
9.16k stars 1.15k forks source link

Tokenization incorrectly splits "gonna" into "gon" and "na" #100

Open whosken opened 9 years ago

whosken commented 9 years ago

Verified that this occurs in 0.10.0 :sob:

>>> import textblob
>>> textblob.TextBlob('gonna do this').words
WordList(['gon', 'na', 'do', 'this'])
ghost commented 7 years ago

@whosken this is the standard NLTK (TreeBank) tokenization. You might wanna use NLTK directly for other options.