meghdadFar / wordview

A Python package for Exploratory Data Analysis (EDA) for text-based data.
MIT License
11 stars 1 forks source link

To count tokens, use a word tokenizer in `wordview.text_analysis.core.do_txt_analysis` #144

Closed meghdadFar closed 6 months ago

meghdadFar commented 6 months ago

Description

Currently in wordview.text_analysis.core.do_txt_analysis tokens are extracted by splitting the text around space. Improve this by using a tokenizer. E.g. nltk word tokenizer.

Solution:

for text in tqdm(df["review"]):
    try:
        sentences = sent_tokenize(text.lower())
        for sentence in sentences:
            sentence_tokens = word_tokenize(sentence)
            num_tokens += len(sentence_tokens)
    except Exception as e:
        print("Processing entry --- %s --- lead to exception: %s" % (text, e.args[0]))
        continue
meghdadFar commented 6 months ago

Resolved by PR #145