Open Iota87 opened 4 years ago
Great comments Henri, and good catches on the typos. I added tokenize, references and adjusted the structure in line with your input. Let me know what you think. I am a bit hesitant to add "remove_html_tags" here because I do not know if it is something that you can easily explain in plain words and in a succinct way to a complete beginner. It can be explained in a separate section/tutorial, but I am not sure you want to get into HTML tags in the getting started. What do you think?
Hi Guys!
Thank you Giovanni for the great start and Henri for the comments!
Sorry for having reviewed that late!
As a general comment, I think we need to make it more technical and concise. The end goal of the getting started preprocessing tutorial is to teach how to use Texthero to actually do text preprocessing.
As we want to guide the user through Texthero preprocessing core, it's important to show them how to actually do the stuff.
Giovanni, do you think you can start from the comment below, test the code in a Juptyer Notebook, and then write around to this a getting-started tutorial? I didn't go into the details to give you more freedom; If you want more advice or something is unclear just let me know!
Kind regards, Jonathan
(overview + what's important to keep in mind)
clean
function, here we want to offer something more and explain to them how to clean some text data, it's important to give users examples as well as guide them through the processIntroduction to this new "chapter" and menstion what we have seen before + introduction sentence about preprocessing ... something like: "By now you should have a general overview of what's Texthero is about, in the next sections we will dig a bit deeper into Texthero's core and see what we can get out of our beautiful text data."
Link + introduction
Mention there is the clean
standard function or that we can customize, as, Mention chaining, all preprocessing's functions receive as input a Pandas Series and they return a Pandas Series. This allows chaining multiple functions in a pandas-pythonic fashion.
FAQ questions, mostly to improve SEO.
Preprocessing is about data cleaning, let's assume we got some dirty data we want to clean, especially, we want to keep only relevant and clean content.
df = pd.DataFrame(["I have the power! $$ (wow!)", "Flame on!", "HULK SMASH!",... Holy ____ Batman! I am the vengeance, I am the night, I am BATMAN! I am GROOT. I’m going ghost! I am the law! SPOOOON!!!"])
Let's start by calling clean ... see what happens.
hero.preprocessing.clean(df['text'])
...
comment ...
Now, assume we want to keep the punctuation marks but remove parenthesis ... open the "preprocessing API" page and look for the "remove_brackets"
Show a custom pipeline and explain it:
df['clean'] = ( df['text'] .pipe(p.function1) .pipe(p.function2) .pipe(p.function3) )
two-three high-quality links to other pages about text-preprocessing + a getting started tutorial on regex with python
Sounds good, Jonathan! I reviewed your comments and suggestions, they are perfectly aligned with what discussed on the call. Working on it! Thanks, Giovanni
To be completed. Need preliminary feedbacks on: