Open jbesomi opened 4 years ago
I am structuring this part as follows (based on review of similar contexts, including Texthero "Getting Started" structure): Overview/Intro Why is pre-processing crucial and what are the benefits of having a standardized/customizable pipeline Clean What it does and how Custom Pipeline Why and how you should take control of the pre-processing steps More details Including pre-processing API functionalities
Please let me know if something is not clear or if you have any additional suggestions.
Task: write the "Getting started: preprocessing" doc page
Advice/Tips to the technical writer
Good to know:
Concept useful to have clear in mind:
pandas.Series.pipe
function workstokenize
, all preprocessing functions receive as input aTextSeries
and returns aTextSeries
.Things to keep in mind when writing:
To stay in the technical discussion loop:
tokenize
, even beforeremove_punctuation
or anything else? This is useful when dealing with non-Western language (see #145 ).Page
aim: learn how to preprocess text-based dataset with Texthero
content:
clean
function: default way, option when no customization is requiredclean
code and edit the pipeline see #38tokenize
TokenSeries
(fundamentally different, show an example, every cell is now a list of tokens)