neomatrix369 / nlp_profiler

A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Other
243 stars 37 forks source link

Improve logic behind spell checking text #8

Closed neomatrix369 closed 1 year ago

neomatrix369 commented 4 years ago

Meaning, in a fair fashion evaluate on the whole how bad is the spelling in the text.

At the moment it's using the below logic:

def spelling_quality_score(text: str) -> float:
    if (not isinstance(text, str)) or (len(text.strip()) == 0):
        return NaN

    tokenized_text = get_tokenized_text(text)
    misspelt_words = [
        each_word for _, each_word in enumerate(tokenized_text)
        if actual_spell_check(each_word) is not None
    ]
    avg_words_per_sentence = \
        len(tokenized_text) / get_sentence_count(text)
    result = 1 - (len(misspelt_words) / avg_words_per_sentence)

    return result if result >= 0.0 else 0.0

Which can be improved as there are visible chances of false positive or false negative scores.

PS: performance of this feature is being addressed on #2, so this particular issue isn't about improving it's speed/performance. Performance issues may be addressed via other issues at a later stage. There has already been some significant performance improvements to the spell check and other aspects of NLP Profiler via #2.

Fix to #14 impacts, this issue, will need to also be fixed together.


~Replace the spellchecker with the package pyspellchecker (on PyPi) which appears to be closer to Peter Norvig's work.~ Replaced with Symspellpy (https://pypi.org/project/symspellpy/)

neomatrix369 commented 4 years ago

A simpler solution would be to revert back to the original logic:

score = 1 - (number_of_incorrect_words / number_of_correct_words)

and adjust the Words of Estimative Probability table to a stricter scoring:

   ["Very good", 99, 100],  
    ["Quite good", 95, 99], 
    ["Good", 90, 95],  
    ["Pretty good", 85, 90], 
    ["Bad", 60, 85],  
    ["Pretty bad", 12, 60],  
    ["Quite bad", 2, 12],  
    ["Very bad", 0, 2]  

We can tune this logic further with new input from users in the community. Eventually, this table could be made custom or can be passed as a parameter to assist in the scoring.

neomatrix369 commented 4 years ago

The new logic can be found in https://github.com/neomatrix369/nlp_profiler/blob/master/nlp_profiler/spelling_quality_check.py#L59 and the changes are as per the comment https://github.com/neomatrix369/nlp_profiler/issues/8#issuecomment-704932155.

May not be the best or the optimal fix, but it's a simple fix to start with.

neomatrix369 commented 4 years ago

Issue is partially fixed via #16.