sloria / TextBlob

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
https://textblob.readthedocs.io/
MIT License
9.08k stars 1.13k forks source link

Tokenize Multiple Emojis #345

Open abushoeb opened 3 years ago

abushoeb commented 3 years ago

The current tokenizer fails to tokenize multiple emojis if they aren't space-separated. For example:

sentence = TextBlob("Emoji 😀 is a new way of expressing emotions 🤩😀! #Emoji. ") 
sentence.words

returns

WordList(['Emoji', '😀', 'is', 'a', 'new', 'way', 'of', 'expressing', 'emotions', '🤩😀', 'Emoji'])

However, Spacy is able to tokenize 🤩😀 as two separate tokens. I'm just wondering if it's possible in Textblob. If not then I would be happy to contribute.