neomatrix369 / nlp_profiler

A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Other
243 stars 37 forks source link

[BUG] Not all granular features are getting generated #78

Open neomatrix369 opened 1 year ago

neomatrix369 commented 1 year ago

Describe the bug

After running the notebook(s) on Kaggle/local machine we can see that not all granular features are getting generated for e.g. these fields 'repeated_letters_count', 'repeated_digits_count', 'repeated_spaces_count', 'repeated_whitespaces_count', 'repeated_punctuations_count', 'english_characters_count', 'non_english_characters_count' in addition to the others are not part of the dataframe, either it's not detected or something else is amiss.

To Reproduce

Run the notebook on Kaggle i.e. https://www.kaggle.com/code/neomatrix369/nlp-profiler-simple-dataset and it fails at the cell that looks for repeat characters, etc...

Version information:

NLP Profiler Version 0.0.3 - issue is not relevant to environment or any other technical parameter. The version on the master branch also behaves in the same manner.

Additional context

From the logs on https://www.kaggle.com/code/neomatrix369/nlp-profiler-simple-dataset#Installation-and-import-libraries/packages - the 0.0.3 version on PyPi worked in the past and for some time has not been working.