Open jbesomi opened 4 years ago
For Asian languages (Chinese, Japanese...), word segmentation is an essential step in preprocessing. We usually remove non-textual characters in corpus, making them looks like naturally written texts, and segment the text with NLP tools. Its difference to the current texthero.preprocessing.clean
is that we need to keep punctuation and digits to make sure the correctness of word segmentation, then we proceed to the following steps. Despite the difference, these steps can be done by language specific functions and custom pipelines, e.g.
from texthero import preprocessing
from texthero import nlp
import texthero as hero
hero.config.set_lang("cn")
custom_pipeline = [preprocessing.fillna,
preprocessing.remove_non_natural_text,
nlp.word_segment,
preprocessing.remove_stopwords,
...]
df['clean_text'] = df['clean_text'].pipe(custom_pipeline)
Seems hero.config.set_lang("lang")
can work properly and avoid redundant codes.
Just used Texthero for the first time yesterday in Portuguese. Pipeline for preprocessing seems fine, except for stopwords. A solution like @AlfredWGA mentioned would be very much appreciated inside hero. i.e. building and running a custom pipeline. In my case, I would only need to change the stopwords function
Just used Texthero for the first time yesterday in Portuguese. Pipeline for preprocessing seems fine, except for stopwords. A solution like @AlfredWGA mentioned would be very much appreciated inside hero. i.e. building and running a custom pipeline. In my case, I would only need to change the stopwords function
Hey @Jkasnese , if you read there in the getting-started, you will see that you can define your own custom pipeline:
from texthero import preprocessing
custom_pipeline = [preprocessing.fillna,
preprocessing.lowercase,
preprocessing.remove_whitespace]
df['clean_text'] = hero.clean(df['text'], custom_pipeline)
Is that what you were looking for?
For Asian languages (Chinese, Japanese...), word segmentation is an essential step in preprocessing. We usually remove non-textual characters in corpus, making them looks like naturally written texts, and segment the text with NLP tools. Its difference to the current
texthero.preprocessing.clean
is that we need to keep punctuation and digits to make sure the correctness of word segmentation, then we proceed to the following steps. Despite the difference, these steps can be done by language specific functions and custom pipelines, e.g.from texthero import preprocessing from texthero import nlp import texthero as hero hero.config.set_lang("cn") custom_pipeline = [preprocessing.fillna, preprocessing.remove_non_natural_text, nlp.word_segment, preprocessing.remove_stopwords, ...] df['clean_text'] = df['clean_text'].pipe(custom_pipeline)
Seems
hero.config.set_lang("lang")
can work properly and avoid redundant codes.
Great!
Just used Texthero for the first time yesterday in Portuguese. Pipeline for preprocessing seems fine, except for stopwords. A solution like @AlfredWGA mentioned would be very much appreciated inside hero. i.e. building and running a custom pipeline. In my case, I would only need to change the stopwords function
Hey @Jkasnese , if you read there in the getting-started, you will see that you can define your own custom pipeline:
from texthero import preprocessing custom_pipeline = [preprocessing.fillna, preprocessing.lowercase, preprocessing.remove_whitespace] df['clean_text'] = hero.clean(df['text'], custom_pipeline)
Is that what you were looking for?
Yes, I think so! Thanks! I'll check later how to pass arguments to the remove_stopwords function. It would be nice if it was possible to do something like:
custom_pipeline = [preprocessing.fillna,
preprocessing.lowercase,
preprocessing.remove_stopwords(my_stopwords),
preprocessing.remove_whitespace]
But I'm guessing its with kargs** right? I don't know, I'm a newbie hahaha gonna check this later. Thanks again!
You can solve it like this:
import texthero as hero
import pandas as pd
s = pd.Series(["is is a stopword"])
custom_set_of_stopwords = ['is']
pipeline = [
lambda s: hero.remove_stopwords(s, stopwords=custom_set_of_stopwords)
]
s.pipe(clean, pipeline=pipeline)
All items in the pipeline
must be callable (i.e functions).
I agree that this is not trivial. We will make sure that this will be well explained both in the docstring of the clean
function as well as in the (soon to arrive) Getting started: preprocessing
part. By the way, @Jkasnese, would you be interested in explaining that in the clean
docstring (your comment then will appear there)?
I'd like that! Can't make guarantees though, since I'm already involved in many projects ): I'll try to do it until Saturday.
I still have some questions, which might be good since I can address them on the docstring. Should I create a new issue (since its a bit off this issue) or message you privately?
Both solutions work; either open an issue or send me an email: jonathanbesomiATgmail.com
You can solve it like this:
import texthero as hero import pandas as pd s = pd.Series(["is is a stopword"]) custom_set_of_stopwords = ['is'] pipeline = [ lambda s: hero.remove_stopwords(s, stopwords=custom_set_of_stopwords) ] s.pipe(clean, pipeline=pipeline)
All items in the
pipeline
must be callable (i.e functions).I agree that this is not trivial. We will make sure that this will be well explained both in the docstring of the
clean
function as well as in the (soon to arrive)Getting started: preprocessing
part. By the way, @Jkasnese, would you be interested in explaining that in theclean
docstring (your comment then will appear there)?
@jbesomi i think it is also good to add a language
parameter in remove_stopwords
. I've used a solution similar to your suggestion above for Spanish and Indonesian languages. We can use the stop-words library to load stop words from different languages. What do you think?
I perfectly agree with what you are proposing; i.e to permit to remove stopwords from a specific language. The only big question (and the main purpose of this discussion tab) is to understand how.
There are different alternatives:
language
parameter (as you were proposing)stopwords
set that contains all stopwords of all languages. The main drawback is that one stopword in a language might not be for another one ... (👎 )In general, I'm against adding too many arguments to functions as this makes it generally more complex to understand and use it ...
Also, something we always need to keep in mind is that, from a multilingual perspective, Texthero is composed of two kinds of functions:
Only some of the preprocessing
functions fall under (1) (tell me if you think that this is wrong). Such functions are for instance remove_stopwords
and tokenize
. It might be redundant to specify for each of these functions the language
parameter, and that's why @AlfredWGA solution of having a global setting hero.config.set_lang("zh")
is probably a great idea.
Your opinions?
Technically, do you think is feasible and not too complex to have a global setting for the lang
at the module level?
I think @AlfredWGA's solution of having a global setting is a better idea than adding language
parameter for every function as i suggested and is also more feasible.
I found a problem of using global setting language. Some of functions cannot be applied to Asian languages, e.g. remove_diacritics
, stem
. Also, remove_punctuation
is integrated into remove_stopwords
after tokenization
. When the user select a certain language to process, I think we shouldn't expose APIs that they cannot use which might lead to confusion. Any idea how to solve this?
Hey @AlfredWGA !
Apologize, what do you mean by "integrated"? (Also, remove_punctuation is integrated into remove_stopwords after tokenization)
I agree.
To probably understand better the problem, we should create a document (by opening a new issue or using Google doc or similar) and for different languages make a list of necessary functions. Presumably, except for some functions in preprocessing
(i.e remove_diacritics
for Asian languages, ...), all other functions will be useful for all languages.
Once we have this, it will be easier to see how to solve the problem. Hopefully, we will notice some patterns and we will be able to classify languages together that share a common preprocessing pipeline. Do you agree?
Then, a simple idea might be to have a select items
menu in the API page that will let the user show only the relevant functions for the given language. This coupled with the getting-started tutorials in all different languages (or kinds of languages) should reduce the confusion.
What are your thoughts?
Hey @AlfredWGA !
Apologize, what do you mean by "integrated"? (Also, remove_punctuation is integrated into remove_stopwords after tokenization)
I agree.
To probably understand better the problem, we should create a document (by opening a new issue or using Google doc or similar) and for different languages make a list of necessary functions. Presumably, except for some functions in
preprocessing
(i.eremove_diacritics
for Asian languages, ...), all other functions will be useful for all languages.Once we have this, it will be easier to see how to solve the problem. Hopefully, we will notice some patterns and we will be able to classify languages together that share a common preprocessing pipeline. Do you agree?
Then, a simple idea might be to have a
select items
menu in the API page that will let the user show only the relevant functions for the given language. This coupled with the getting-started tutorials in all different languages (or kinds of languages) should reduce the confusion.What are your thoughts?
Sorry for the confusion. The default pipeline for preprocessing Chinese text should look like this,
def get_default_pipeline() -> List[Callable[[pd.Series], pd.Series]]:
return [
fillna,
remove_whitespace,
tokenize,
remove_digits,
remove_stopwords
]
Punctuations and stopwords should be removed after tokenization (as they might affect the word segmentation results). We can put punctuations into the list of stopwords and remove them together using remove_stopwords
, therefore a series of remove_**
might be unnecessary. Plus, all functions in preprocessing.py
deal with Series with str
. With tokenize
as a prior step, a lot of functions have to require Series of list
as input.
In this case, if we use hero.config.set_lang("lang")
, how can we make unnecessary functions invisible when user call hero.**
? On the other hand, from texthero.lang import hero_lang
can import only necessary functions for a certain language.
Hey @AlfredWGA, sorry for the late reply.
I agree that a series of remove_*
might be unnecessary. On the other hand, someone might just need to apply remove_punctuation
for some reasons and in that case such function might be handy.
Regarding the tokenize
, what you say is super interesting. For you to know, in the next version, all representation
functions will require as input a TokenSeries
(a tokenized series).
In the case of preprocessing
, if we consider Western language, the actual approach is that the tokenization is done once at the end of the preprocessing phase. remove_punctuation
and remove_stopwords
do not necessarily require to receive as input an already tokenized string as we might just apply string.replace
:
The main reason until now we are not requiring preprocessing
functions to receive a tokenized series as input is for performance. For example, remove_punctuation
is using str.replace
+ regex. I assumed that this was faster than iterating over every token and remove the ones that are punctuation.
For Asian Language, tokenization
should be strictly done before applying remove_*
? If yes, I'm fine re-consider this and having that the first task consists of tokenizing the Series. As we aim for simplicity and unification, it would not make sense to have two different approaches for different languages (when it exist a universal solution)
One more thing regarding what you were proposing (remove_*
):
As an alternative to multiple remove_*
we might have a universal function remove(s, tokens)
that remove all tokens from the (tokenized) Series. Then, we might provide through a module or something similar, such collections of tokens
:
from tokens import stopwords
from tokens import punctuation
from tokens import ...
s = pd.Series([...])
s = hero.tokenize(s)
hero.remove(s, stopwords.union(punctuation).union(...) )
Looking forward to hearing from you! 👍
For Asian Language,
tokenization
should be strictly done before applyingremove_*
?
From my perspective, yes, but except for some strings that won't interfere word segmentation (urls, html tags, \n
and \t
, etc.).
TokenSeries
and a tokens
module is a great idea. If we implement that , the clean
for Asian languages can be like this,
from texthero.lang import hero_cn as hero
[hero.preprocessing.fillna,
hero.preprocessing.remove_whitespace,
...,
hero.preprocessing.tokenize,
hero.remove(s, hero.tokens.stopwords.union(punctuation).union(...) )]
Then the cleaning process of Western and Asian will be unified. What do you think?
Sounds good!
We might call the tokens
module collections
or something similar:
from texthero.collections import stopwords
from texthero.collections import punctuation
...
Yes, TokenSeries
seems a promising idea. @henrifroese is working on it. If you are interested have a look at #60.
@AlfredWGA How do you suggest we proceed, in relation on how you plan to contribute.
I'll start implementing texthero.lang.hero_cn
to make all functions support Chinese. If #60 is completed I'll refactor the code to accommodate this feature. Is that OK?
OK!
Text preprocessing might be very language-dependent.
It would be great for Texthero to offer text preprocessing in all different languages.
There are probably two ways to support multilanguage:
We might have also cases where the dataset is composed of many languages. What would be the best solution in that case?
The first question we probably have to solve is: does different languages requires very different preprocessing pipeline and therefore different functions?