Tokenization incorrectly splits "gonna" into "gon" and "na"

sloria / TextBlob

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

https://textblob.readthedocs.io/

MIT License

9.16k stars 1.15k forks source link

Open whosken opened 9 years ago

whosken commented 9 years ago

Verified that this occurs in 0.10.0 :sob:

>>> import textblob
>>> textblob.TextBlob('gonna do this').words
WordList(['gon', 'na', 'do', 'this'])

ghost commented 7 years ago

@whosken this is the standard NLTK (TreeBank) tokenization. You might wanna use NLTK directly for other options.