jbesomi / texthero

Text preprocessing, representation and visualization from zero to hero.
https://texthero.org
MIT License
2.89k stars 239 forks source link

punctuation not being removed correctly using `preprocessing.clean` #207

Open aliforgetti opened 3 years ago

aliforgetti commented 3 years ago

This is my code and I was trying to clean a large dataset

full_data['text_pp'] = (
    full_data['text']
    .pipe(hero.preprocessing.clean)
    .pipe(hero.remove_urls)
)

According to the documentation this is the default pipeline for the clean functionality:

Default pipeline:
texthero.preprocessing.fillna()

texthero.preprocessing.lowercase()

texthero.preprocessing.remove_digits()

texthero.preprocessing.remove_punctuation()

texthero.preprocessing.remove_diacritics()

texthero.preprocessing.remove_stopwords()

texthero.preprocessing.remove_whitespace()

But my ouput does not reflect this as some of the punctuation remained in the text.

Original text column image

Preprocessed text column image

henrifroese commented 3 years ago

Hi, could you paste the actual data you're using? (Just one of the texts would help probably).

For me with the beginning of your first text, the punctuation is removed successfully:

>>> import texthero as hero
>>> import pandas as pd
>>> s = pd.Series(["Honestly people don't know about the fact ..."])
>>> hero.clean(s)
0    honestly people know fact
dtype: object

The issue is probably that some punctuation in your text is not "standard" punctuation (texthero internally uses import string; string.punctuation so if it's not in there it won't be removed

jbesomi commented 3 years ago

Thank you @henrifroese. @aliforgetti do you have any updates?