Maybe it would be good to collect several smaller TODOs here:
[ ] the nltk word_tokenizer also lists punctuation. I am not sure if we want that. For the moment, I have added a simple list comprehension to filter \w+ only... but we might need to think of better solutions here (or is we stick to this, we can also use NLTK's RegexTokenizer.
[ ] Add functions to directly work with twitter data from jsonl files (low priority)
Maybe it would be good to collect several smaller TODOs here:
word_tokenizer
also lists punctuation. I am not sure if we want that. For the moment, I have added a simple list comprehension to filter\w+
only... but we might need to think of better solutions here (or is we stick to this, we can also use NLTK'sRegexTokenizer
.