Currently in wordview.text_analysis.core.do_txt_analysis tokens are extracted by splitting the text around space. Improve this by using a tokenizer. E.g. nltk word tokenizer.
Solution:
for text in tqdm(df["review"]):
try:
sentences = sent_tokenize(text.lower())
for sentence in sentences:
sentence_tokens = word_tokenize(sentence)
num_tokens += len(sentence_tokens)
except Exception as e:
print("Processing entry --- %s --- lead to exception: %s" % (text, e.args[0]))
continue
Description
Currently in
wordview.text_analysis.core.do_txt_analysis
tokens are extracted by splitting the text around space. Improve this by using a tokenizer. E.g. nltk word tokenizer.Solution: