bitanb1999 commented 1 year ago

Please check the options that you have completed and strike out the options that do not apply via this pull request:

[X] a clear title and description of the Pull Request has been provided you have read
[X] the Contributing doc
[X] the Developer Guide
[X] the pull request passes the tests (`./test-coverage "tests slow-tests"``) - this will also be visible via the Code coverage report and CI/CD task on the Pull Request
[X] you have performed some kind of smoke test by running your changes in an isolated environment i.e. Docker container, Google Colab, Kaggle, etc...
[] ~~the notebooks are updated (see notebooks folder, read the Notebooks docs)~~
[X] CHANGELOG.md has been updated (please follow the existing format)

Goal or purpose of the PR

The spelling checker previously used TextBlob and required tokenization for the spelling checking and spelling quality summarisation. This took significant time and the result score calculated was also not satisfactory.

Changes implemented in the PR

I replaced the checker function with a package that states to be much faster than TextBlob and jamspell, namely, Symspellpy. Further, the result scoring was entirely based on the ratio of the number of misspelled words to the total length of the string. This doesn't take ease of reading or "whether the phrase makes sense" into account. To resolve these issues, I used fuzzy-matching techniques that compare the original text and the rectified text and mark the score of the text accordingly.

sourcery-ai[bot] commented 1 year ago

Sourcery Code Quality Report

❌ Merging this PR will decrease code quality in the affected files by 2.06%.

Quality metrics	Before	After	Change
Complexity	3.06 ⭐	3.51 ⭐	0.45 👎
Method Length	38.27 ⭐	43.00 ⭐	4.73 👎
Working memory	4.81 ⭐	5.10 ⭐	0.29 👎
Quality	86.71% ⭐	84.65% ⭐	-2.06% 👎

Other metrics	Before	After	Change
Lines	137	154	17

Changed files	Quality Before	Quality After	Quality Change
nlp_profiler/high_level_features/ease_of_reading_check.py	85.73% ⭐	85.18% ⭐	-0.55% 👎
nlp_profiler/high_level_features/spelling_quality_check.py	87.36% ⭐	84.28% ⭐	-3.08% 👎

Here are some functions in these files that still need a tune-up:

File	Function	Complexity	Length	Working Memory	Quality	Recommendation

Legend and Explanation

The emojis denote the absolute quality of the code:

⭐ excellent
🙂 good
😞 poor
⛔ very poor

The 👍 and 👎 indicate whether the quality has improved or gotten worse with this pull request.

Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Help us improve this quality report!

neomatrix369 commented 1 year ago

This PR depends on the merging of PR #69 - once we merge that PR the current one can proceed but till then let's resolve any comments on this PR

neomatrix369 commented 1 year ago

Please also do one last check in https://github.com/neomatrix369/nlp_profiler/blob/master/CONTRIBUTING.md to see if any dependent files need changing i.e. re-running notebooks etc, the Developer Guide is also something to review as a closing action.

Maybe you can enhance the existing grammar check example in the notebook(s) to illustrate the new package's features.

There are notebooks on this repo, please take a look at them and re-run them on your local machine to see if your changes have taken effect and no issues have arisen.

There are also markdown files in this repo, they may need a touch-up due to this change - can you pls check if that's the case?

neomatrix369 commented 1 year ago

This PR is related to https://github.com/neomatrix369/nlp_profiler/issues/8, have a good read of the issue to see if all or most of the requirements there are resolved by this PR

bitanb1999 commented 1 year ago

This PR is related to #8, have a good read of the issue to see if all or most of the requirements there are resolved by this PR

I checked #8 and #2 and it addresses both issues. The results have been modified with fuzzy algorithm and they are penalizing for each misspelled word and arrangement of tokens. See this article: https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe Also, the Symspell package is much faster than TextBlob as stated by multiple articles and hence #2 is also being addressed.

neomatrix369 commented 1 year ago

One last thing to do is update the CHANGELOG.md for this change - its very easy to do, see how the previous ones are done

neomatrix369 / nlp_profiler

Spelling checker has been modified #71

Goal or purpose of the PR

Changes implemented in the PR

Sourcery Code Quality Report

Legend and Explanation