Open abushoeb opened 3 years ago
The current tokenizer fails to tokenize multiple emojis if they aren't space-separated. For example:
sentence = TextBlob("Emoji 😀 is a new way of expressing emotions 🤩😀! #Emoji. ") sentence.words
returns
WordList(['Emoji', '😀', 'is', 'a', 'new', 'way', 'of', 'expressing', 'emotions', '🤩😀', 'Emoji'])
However, Spacy is able to tokenize 🤩😀 as two separate tokens. I'm just wondering if it's possible in Textblob. If not then I would be happy to contribute.
The current tokenizer fails to tokenize multiple emojis if they aren't space-separated. For example:
returns
WordList(['Emoji', '😀', 'is', 'a', 'new', 'way', 'of', 'expressing', 'emotions', '🤩😀', 'Emoji'])
However, Spacy is able to tokenize 🤩😀 as two separate tokens. I'm just wondering if it's possible in Textblob. If not then I would be happy to contribute.