Closed trutzig89182 closed 2 years ago
I agree! I will do that later today. Idea: Add parameter to indicate whether we are dealing with tweets or normal texts in order to switch between Tokenizer()
and TweetTokenizer()
.
I have just implemented a simple tokenizer that also allows to tokenize tweets if needed. Default is the normal nltk.tokenize.word_tokenize
. It is important to install required nltk packages first (see README.md
).
Seems to be okay for the moment.
Writing the unit tests I saw that punctuation signs are considered to be part of the words in our first very rough separation of words with word_list = document.split(" ").
So implementing the a tokenizer is an important next step. You already mentioned the NLTK tokenizer in the script.