Iota87 commented 4 years ago

To be completed. Need preliminary feedbacks on:

Structure (also keeping in mind rendering on website)
Tone (e.g. use of examples)
Length / level of details

Iota87 commented 4 years ago

Great comments Henri, and good catches on the typos. I added tokenize, references and adjusted the structure in line with your input. Let me know what you think. I am a bit hesitant to add "remove_html_tags" here because I do not know if it is something that you can easily explain in plain words and in a succinct way to a complete beginner. It can be explained in a separate section/tutorial, but I am not sure you want to get into HTML tags in the getting started. What do you think?

jbesomi commented 4 years ago

Hi Guys!

Thank you Giovanni for the great start and Henri for the comments!

Sorry for having reviewed that late!

As a general comment, I think we need to make it more technical and concise. The end goal of the getting started preprocessing tutorial is to teach how to use Texthero to actually do text preprocessing.

As we want to guide the user through Texthero preprocessing core, it's important to show them how to actually do the stuff.

Giovanni, do you think you can start from the comment below, test the code in a Juptyer Notebook, and then write around to this a getting-started tutorial? I didn't go into the details to give you more freedom; If you want more advice or something is unclear just let me know!

Kind regards, Jonathan

(overview + what's important to keep in mind)

one of Texthero's pillar is text preprocessing
need to mention the modularity approach (one function for one task), and that the user can customize the pipeline
preprocessing is task and domain-specific. The developer needs to know what he wants, Texthero provide a tool to quickly experiment. It's advised to start with the standard clean pipeline, see if that work, and otherwise iteratively try to solve the problem
The Texthero preprocessing is seen more as a pre-processing step for bag-of-words models, where what matters is the content (not the grammar or punctuation). In bag-of-words models, we want to get rid of punctuations and stopwords and we want to normalize (stem). This is different from the more advanced and complex neural network transformers architectures ... here we might want to keep the punctuation as well as the stopwords ... but, if the text data are very dirty, then a general cleaning might be useful anyway (for example removal of round brackets and content generally help + replacement of 12.3 numbers to NUM might help as well) ...
Users come here after having read the "getting started" page, they already know about the clean function, here we want to offer something more and explain to them how to clean some text data, it's important to give users examples as well as guide them through the process
We want to teach the users to use the API preprocessing, and we want to mention at least 50% of such functions
Tokenization part: hide for now, as we are making main changes there

Preprocessing

Overview

Introduction to this new "chapter" and menstion what we have seen before + introduction sentence about preprocessing ... something like: "By now you should have a general overview of what's Texthero is about, in the next sections we will dig a bit deeper into Texthero's core and see what we can get out of our beautiful text data."

Preprocessing API

Link + introduction

Doing it right

There is no magic formula that works in every situation, Texthero provides a modular approach to deal with data processing
The user needs to understand what it actually requires.
Texthero is mostly used to get a first feeling of the data, using bag-of-words approaches, in this case, the goal is to try to keep relevant and clean content
Mention bag-of-words approach, explain the difference between transformers. Here is really from raw data (maybe coming from an ocr or scraped from a website) to something cleaner.

Standard vs Custom pipeline ( old key function)

Mention there is the clean standard function or that we can customize, as, Mention chaining, all preprocessing's functions receive as input a Pandas Series and they return a Pandas Series. This allows chaining multiple functions in a pandas-pythonic fashion.

FAQ

FAQ questions, mostly to improve SEO.

Text preprocessing, From zero to hero

Preprocessing is about data cleaning, let's assume we got some dirty data we want to clean, especially, we want to keep only relevant and clean content.

df = pd.DataFrame(["I have the power! $$ (wow!)", "Flame on!", "HULK SMASH!",... Holy ____ Batman! I am the vengeance, I am the night, I am BATMAN! I am GROOT. I’m going ghost! I am the law! SPOOOON!!!"])

Let's start by calling clean ... see what happens.

hero.preprocessing.clean(df['text'])

...

comment ...

Now, assume we want to keep the punctuation marks but remove parenthesis ... open the "preprocessing API" page and look for the "remove_brackets"

Show a custom pipeline and explain it:

df['clean'] = ( df['text'] .pipe(p.function1) .pipe(p.function2) .pipe(p.function3) )

Going further

two-three high-quality links to other pages about text-preprocessing + a getting started tutorial on regex with python

Recap

Iota87 commented 4 years ago

Sounds good, Jonathan! I reviewed your comments and suggestions, they are perfectly aligned with what discussed on the call. Working on it! Thanks, Giovanni

jbesomi / texthero

Draft for getting-started-preprocessing #183

Preprocessing

Overview

Preprocessing API

Doing it right

Standard vs Custom pipeline ( old key function)

FAQ

Text preprocessing, From zero to hero

Going further

Recap