neomatrix369 / nlp_profiler

A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Other
241 stars 37 forks source link

Spelling checker has been modified #71

Closed bitanb1999 closed 1 year ago

bitanb1999 commented 1 year ago

Please check the options that you have completed and strike out the options that do not apply via this pull request:

Goal or purpose of the PR

The spelling checker previously used TextBlob and required tokenization for the spelling checking and spelling quality summarisation. This took significant time and the result score calculated was also not satisfactory.

Changes implemented in the PR

I replaced the checker function with a package that states to be much faster than TextBlob and jamspell, namely, Symspellpy. Further, the result scoring was entirely based on the ratio of the number of misspelled words to the total length of the string. This doesn't take ease of reading or "whether the phrase makes sense" into account. To resolve these issues, I used fuzzy-matching techniques that compare the original text and the rectified text and mark the score of the text accordingly.

sourcery-ai[bot] commented 1 year ago

Sourcery Code Quality Report

❌  Merging this PR will decrease code quality in the affected files by 2.06%.

Quality metrics Before After Change
Complexity 3.06 ⭐ 3.51 ⭐ 0.45 πŸ‘Ž
Method Length 38.27 ⭐ 43.00 ⭐ 4.73 πŸ‘Ž
Working memory 4.81 ⭐ 5.10 ⭐ 0.29 πŸ‘Ž
Quality 86.71% ⭐ 84.65% ⭐ -2.06% πŸ‘Ž
Other metrics Before After Change
Lines 137 154 17
Changed files Quality Before Quality After Quality Change
nlp_profiler/high_level_features/ease_of_reading_check.py 85.73% ⭐ 85.18% ⭐ -0.55% πŸ‘Ž
nlp_profiler/high_level_features/spelling_quality_check.py 87.36% ⭐ 84.28% ⭐ -3.08% πŸ‘Ž

Here are some functions in these files that still need a tune-up:

File Function Complexity Length Working Memory Quality Recommendation

Legend and Explanation

The emojis denote the absolute quality of the code:

The πŸ‘ and πŸ‘Ž indicate whether the quality has improved or gotten worse with this pull request.


Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Help us improve this quality report!

neomatrix369 commented 1 year ago

This PR depends on the merging of PR #69 - once we merge that PR the current one can proceed but till then let's resolve any comments on this PR

neomatrix369 commented 1 year ago

Please also do one last check in https://github.com/neomatrix369/nlp_profiler/blob/master/CONTRIBUTING.md to see if any dependent files need changing i.e. re-running notebooks etc, the Developer Guide is also something to review as a closing action.

Maybe you can enhance the existing grammar check example in the notebook(s) to illustrate the new package's features.

There are notebooks on this repo, please take a look at them and re-run them on your local machine to see if your changes have taken effect and no issues have arisen.

There are also markdown files in this repo, they may need a touch-up due to this change - can you pls check if that's the case?

neomatrix369 commented 1 year ago

This PR is related to https://github.com/neomatrix369/nlp_profiler/issues/8, have a good read of the issue to see if all or most of the requirements there are resolved by this PR

bitanb1999 commented 1 year ago

This PR is related to #8, have a good read of the issue to see if all or most of the requirements there are resolved by this PR

I checked #8 and #2 and it addresses both issues. The results have been modified with fuzzy algorithm and they are penalizing for each misspelled word and arrangement of tokens. See this article: https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe Also, the Symspell package is much faster than TextBlob as stated by multiple articles and hence #2 is also being addressed.

neomatrix369 commented 1 year ago

One last thing to do is update the CHANGELOG.md for this change - its very easy to do, see how the previous ones are done