neomatrix369 / nlp_profiler

A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Other
243 stars 37 forks source link

Changes made to grammar check function #69

Closed bitanb1999 closed 1 year ago

bitanb1999 commented 1 year ago

Please check the options that you have completed and strike-out the options that do not apply via this pull request:

Goal or purpose of the PR

The grammar check function previously used the python language tool, which took significant time to process each text in the textual dataframe and return the output.

Changes implemented in the PR

I analyzed the alternatives available in NLP and came across two options: 1. Happy transformers with the hyperparameter tuning of Gramformer( check: https://github.com/PrithivirajDamodaran/Gramformer) and Gingerit package (check: https://github.com/Azd325/gingerit). Gingerit had a throughput time of 34.8 seconds whereas the language tool from python took 41secs to process each text. This seemed to be a huge upgrade.
Transformers are also a great alternative and did equivalently well but given the constraint of accessing Huggingface every time a text needs to be checked, seems like unnecessary overhead. I have made the changes to the requirement file and to the grammar check python file.

sourcery-ai[bot] commented 1 year ago

Sourcery Code Quality Report

✅  Merging this PR will increase code quality in the affected files by 1.06%.

Quality metrics Before After Change
Complexity 1.94 ⭐ 1.94 ⭐ 0.00
Method Length 35.25 ⭐ 34.50 ⭐ -0.75 👍
Working memory 4.88 ⭐ 4.56 ⭐ -0.32 👍
Quality 89.10% 90.16% 1.06% 👍
Other metrics Before After Change
Lines 37 39 2
Changed files Quality Before Quality After Quality Change
nlp_profiler/high_level_features/grammar_quality_check.py 89.10% ⭐ 90.16% ⭐ 1.06% 👍

Here are some functions in these files that still need a tune-up:

File Function Complexity Length Working Memory Quality Recommendation

Legend and Explanation

The emojis denote the absolute quality of the code:

The 👍 and 👎 indicate whether the quality has improved or gotten worse with this pull request.


Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Help us improve this quality report!

neomatrix369 commented 1 year ago

Well done with @sourcery-ai improvements

neomatrix369 commented 1 year ago

Overall really great work doing the analysis and verifying and checking other alternatives to replace the existing slow grammar checker with a better alternative, also shows we can plug-in and play different tooling with little or not too significant changes .

neomatrix369 commented 1 year ago

Please also do one last check in https://github.com/neomatrix369/nlp_profiler/blob/master/CONTRIBUTING.md to see if any dependent files need changing i.e. re-running notebooks etc, the Developer Guide is also something to review as a closing action.

Maybe you can enhance the existing grammar check example in the notebook(s) to illustrate the new package's features.

There are notebooks on this repo, please take a look at them and re-run them on your local machine to see if your changes have taken effect and no issues have arisen.

There are also markdown files in this repo, they may need a touch-up due to this change - can you pls check if that's the case?

neomatrix369 commented 1 year ago

@bitanb1999 Please read this comment and try to see if you can resolve it https://github.com/neomatrix369/nlp_profiler/pull/69#issuecomment-1463759805 - in either case respond on the PR with your findings

bitanb1999 commented 1 year ago

Please also do one last check in https://github.com/neomatrix369/nlp_profiler/blob/master/CONTRIBUTING.md to see if any dependent files need changing i.e. re-running notebooks etc, the Developer Guide is also something to review as a closing action.

Maybe you can enhance the existing grammar check example in the notebook(s) to illustrate the new package's features.

There are notebooks on this repo, please take a look at them and re-run them on your local machine to see if your changes have taken effect and no issues have arisen.

There are also markdown files in this repo, they may need a touch-up due to this change - can you pls check if that's the case?

I have gone through the contributing.md and I have abided by them all. Also, since the function is being just changed and there is no significant outer change in how the user will be calling the profiler, the notebooks remain as they are and the functions are to be called, as they were being called previously. Hence, no updates are needed in the notebooks.

neomatrix369 commented 1 year ago

If you see code format across the changes is inconsistent - linter/formatter would pick this up

neomatrix369 commented 1 year ago

One last thing to do is update the CHANGELOG.md for this change - its very easy to do, see how the previous ones are done