jbesomi / texthero

Text preprocessing, representation and visualization from zero to hero.
https://texthero.org
MIT License
2.89k stars 239 forks source link

How to provide multilingual support #84

Open jbesomi opened 4 years ago

jbesomi commented 4 years ago

Text preprocessing might be very language-dependent.

It would be great for Texthero to offer text preprocessing in all different languages.

There are probably two ways to support multilanguage:

from texthero.lang import hero_lang
from texthero import hero
hero.config.set_lang("lang")

We might have also cases where the dataset is composed of many languages. What would be the best solution in that case?

The first question we probably have to solve is: does different languages requires very different preprocessing pipeline and therefore different functions?

ryangawei commented 4 years ago

For Asian languages (Chinese, Japanese...), word segmentation is an essential step in preprocessing. We usually remove non-textual characters in corpus, making them looks like naturally written texts, and segment the text with NLP tools. Its difference to the current texthero.preprocessing.clean is that we need to keep punctuation and digits to make sure the correctness of word segmentation, then we proceed to the following steps. Despite the difference, these steps can be done by language specific functions and custom pipelines, e.g.

from texthero import preprocessing
from texthero import nlp
import texthero as hero
hero.config.set_lang("cn")

custom_pipeline = [preprocessing.fillna,
                   preprocessing.remove_non_natural_text,
                   nlp.word_segment,
                   preprocessing.remove_stopwords,
                   ...]
df['clean_text'] = df['clean_text'].pipe(custom_pipeline)

Seems hero.config.set_lang("lang") can work properly and avoid redundant codes.

guilhermelowa commented 4 years ago

Just used Texthero for the first time yesterday in Portuguese. Pipeline for preprocessing seems fine, except for stopwords. A solution like @AlfredWGA mentioned would be very much appreciated inside hero. i.e. building and running a custom pipeline. In my case, I would only need to change the stopwords function

jbesomi commented 4 years ago

Just used Texthero for the first time yesterday in Portuguese. Pipeline for preprocessing seems fine, except for stopwords. A solution like @AlfredWGA mentioned would be very much appreciated inside hero. i.e. building and running a custom pipeline. In my case, I would only need to change the stopwords function

Hey @Jkasnese , if you read there in the getting-started, you will see that you can define your own custom pipeline:

from texthero import preprocessing

custom_pipeline = [preprocessing.fillna,
                   preprocessing.lowercase,
                   preprocessing.remove_whitespace]
df['clean_text'] = hero.clean(df['text'], custom_pipeline)

Is that what you were looking for?

jbesomi commented 4 years ago

For Asian languages (Chinese, Japanese...), word segmentation is an essential step in preprocessing. We usually remove non-textual characters in corpus, making them looks like naturally written texts, and segment the text with NLP tools. Its difference to the current texthero.preprocessing.clean is that we need to keep punctuation and digits to make sure the correctness of word segmentation, then we proceed to the following steps. Despite the difference, these steps can be done by language specific functions and custom pipelines, e.g.

from texthero import preprocessing
from texthero import nlp
import texthero as hero
hero.config.set_lang("cn")

custom_pipeline = [preprocessing.fillna,
                   preprocessing.remove_non_natural_text,
                   nlp.word_segment,
                   preprocessing.remove_stopwords,
                   ...]
df['clean_text'] = df['clean_text'].pipe(custom_pipeline)

Seems hero.config.set_lang("lang") can work properly and avoid redundant codes.

Great!

guilhermelowa commented 4 years ago

Just used Texthero for the first time yesterday in Portuguese. Pipeline for preprocessing seems fine, except for stopwords. A solution like @AlfredWGA mentioned would be very much appreciated inside hero. i.e. building and running a custom pipeline. In my case, I would only need to change the stopwords function

Hey @Jkasnese , if you read there in the getting-started, you will see that you can define your own custom pipeline:

from texthero import preprocessing

custom_pipeline = [preprocessing.fillna,
                   preprocessing.lowercase,
                   preprocessing.remove_whitespace]
df['clean_text'] = hero.clean(df['text'], custom_pipeline)

Is that what you were looking for?

Yes, I think so! Thanks! I'll check later how to pass arguments to the remove_stopwords function. It would be nice if it was possible to do something like:

custom_pipeline = [preprocessing.fillna,
                   preprocessing.lowercase,
                   preprocessing.remove_stopwords(my_stopwords),
                   preprocessing.remove_whitespace]

But I'm guessing its with kargs** right? I don't know, I'm a newbie hahaha gonna check this later. Thanks again!

jbesomi commented 4 years ago

You can solve it like this:

import texthero as hero
import pandas as pd

s = pd.Series(["is is a stopword"])
custom_set_of_stopwords = ['is']

pipeline = [
    lambda s: hero.remove_stopwords(s, stopwords=custom_set_of_stopwords)
]

s.pipe(clean, pipeline=pipeline)

All items in the pipeline must be callable (i.e functions).

I agree that this is not trivial. We will make sure that this will be well explained both in the docstring of the clean function as well as in the (soon to arrive) Getting started: preprocessing part. By the way, @Jkasnese, would you be interested in explaining that in the clean docstring (your comment then will appear there)?

guilhermelowa commented 4 years ago

I'd like that! Can't make guarantees though, since I'm already involved in many projects ): I'll try to do it until Saturday.

I still have some questions, which might be good since I can address them on the docstring. Should I create a new issue (since its a bit off this issue) or message you privately?

jbesomi commented 4 years ago

Both solutions work; either open an issue or send me an email: jonathanbesomiATgmail.com

cedricconol commented 4 years ago

You can solve it like this:

import texthero as hero
import pandas as pd

s = pd.Series(["is is a stopword"])
custom_set_of_stopwords = ['is']

pipeline = [
    lambda s: hero.remove_stopwords(s, stopwords=custom_set_of_stopwords)
]

s.pipe(clean, pipeline=pipeline)

All items in the pipeline must be callable (i.e functions).

I agree that this is not trivial. We will make sure that this will be well explained both in the docstring of the clean function as well as in the (soon to arrive) Getting started: preprocessing part. By the way, @Jkasnese, would you be interested in explaining that in the clean docstring (your comment then will appear there)?

@jbesomi i think it is also good to add a language parameter in remove_stopwords. I've used a solution similar to your suggestion above for Spanish and Indonesian languages. We can use the stop-words library to load stop words from different languages. What do you think?

jbesomi commented 4 years ago

I perfectly agree with what you are proposing; i.e to permit to remove stopwords from a specific language. The only big question (and the main purpose of this discussion tab) is to understand how.

There are different alternatives:

  1. Add a language parameter (as you were proposing)
  2. Automatically detect the language (see #3) and remove the stopwords for such a language.
  3. By default create a very big stopwords set that contains all stopwords of all languages. The main drawback is that one stopword in a language might not be for another one ... (👎 )

In general, I'm against adding too many arguments to functions as this makes it generally more complex to understand and use it ...

Also, something we always need to keep in mind is that, from a multilingual perspective, Texthero is composed of two kinds of functions:

  1. Functions that should make a distinction on the Series (or cell) language type
  2. Functions that work independently on the underline language

Only some of the preprocessing functions fall under (1) (tell me if you think that this is wrong). Such functions are for instance remove_stopwords and tokenize. It might be redundant to specify for each of these functions the language parameter, and that's why @AlfredWGA solution of having a global setting hero.config.set_lang("zh") is probably a great idea.

Your opinions?

Technically, do you think is feasible and not too complex to have a global setting for the lang at the module level?

cedricconol commented 4 years ago

I think @AlfredWGA's solution of having a global setting is a better idea than adding language parameter for every function as i suggested and is also more feasible.

3 is also very interesting and might be an even better idea as it automates the process. It aligns perfectly well to Texthero's purpose of being easy to use and understand.

ryangawei commented 4 years ago

I found a problem of using global setting language. Some of functions cannot be applied to Asian languages, e.g. remove_diacritics, stem. Also, remove_punctuation is integrated into remove_stopwords after tokenization. When the user select a certain language to process, I think we shouldn't expose APIs that they cannot use which might lead to confusion. Any idea how to solve this?

jbesomi commented 4 years ago

Hey @AlfredWGA !

Apologize, what do you mean by "integrated"? (Also, remove_punctuation is integrated into remove_stopwords after tokenization)

I agree.

To probably understand better the problem, we should create a document (by opening a new issue or using Google doc or similar) and for different languages make a list of necessary functions. Presumably, except for some functions in preprocessing (i.e remove_diacritics for Asian languages, ...), all other functions will be useful for all languages.

Once we have this, it will be easier to see how to solve the problem. Hopefully, we will notice some patterns and we will be able to classify languages together that share a common preprocessing pipeline. Do you agree?

Then, a simple idea might be to have a select items menu in the API page that will let the user show only the relevant functions for the given language. This coupled with the getting-started tutorials in all different languages (or kinds of languages) should reduce the confusion.

What are your thoughts?

ryangawei commented 4 years ago

Hey @AlfredWGA !

Apologize, what do you mean by "integrated"? (Also, remove_punctuation is integrated into remove_stopwords after tokenization)

I agree.

To probably understand better the problem, we should create a document (by opening a new issue or using Google doc or similar) and for different languages make a list of necessary functions. Presumably, except for some functions in preprocessing (i.e remove_diacritics for Asian languages, ...), all other functions will be useful for all languages.

Once we have this, it will be easier to see how to solve the problem. Hopefully, we will notice some patterns and we will be able to classify languages together that share a common preprocessing pipeline. Do you agree?

Then, a simple idea might be to have a select items menu in the API page that will let the user show only the relevant functions for the given language. This coupled with the getting-started tutorials in all different languages (or kinds of languages) should reduce the confusion.

What are your thoughts?

Sorry for the confusion. The default pipeline for preprocessing Chinese text should look like this,

def get_default_pipeline() -> List[Callable[[pd.Series], pd.Series]]:
    return [
        fillna,
        remove_whitespace,
        tokenize,
        remove_digits,
        remove_stopwords
    ]

Punctuations and stopwords should be removed after tokenization (as they might affect the word segmentation results). We can put punctuations into the list of stopwords and remove them together using remove_stopwords, therefore a series of remove_** might be unnecessary. Plus, all functions in preprocessing.py deal with Series with str. With tokenize as a prior step, a lot of functions have to require Series of list as input.

In this case, if we use hero.config.set_lang("lang"), how can we make unnecessary functions invisible when user call hero.**? On the other hand, from texthero.lang import hero_lang can import only necessary functions for a certain language.

jbesomi commented 4 years ago

Hey @AlfredWGA, sorry for the late reply.

I agree that a series of remove_* might be unnecessary. On the other hand, someone might just need to apply remove_punctuation for some reasons and in that case such function might be handy.

Regarding the tokenize, what you say is super interesting. For you to know, in the next version, all representation functions will require as input a TokenSeries (a tokenized series).

In the case of preprocessing, if we consider Western language, the actual approach is that the tokenization is done once at the end of the preprocessing phase. remove_punctuation and remove_stopwords do not necessarily require to receive as input an already tokenized string as we might just apply string.replace:

The main reason until now we are not requiring preprocessing functions to receive a tokenized series as input is for performance. For example, remove_punctuation is using str.replace+ regex. I assumed that this was faster than iterating over every token and remove the ones that are punctuation.

For Asian Language, tokenization should be strictly done before applying remove_*? If yes, I'm fine re-consider this and having that the first task consists of tokenizing the Series. As we aim for simplicity and unification, it would not make sense to have two different approaches for different languages (when it exist a universal solution)

One more thing regarding what you were proposing (remove_*):

As an alternative to multiple remove_* we might have a universal function remove(s, tokens) that remove all tokens from the (tokenized) Series. Then, we might provide through a module or something similar, such collections of tokens:

from tokens import stopwords
from tokens import punctuation
from tokens import ...

s = pd.Series([...])
s = hero.tokenize(s)
hero.remove(s, stopwords.union(punctuation).union(...) )

Looking forward to hearing from you! 👍

ryangawei commented 4 years ago

For Asian Language, tokenization should be strictly done before applying remove_*?

From my perspective, yes, but except for some strings that won't interfere word segmentation (urls, html tags, \n and \t, etc.).

TokenSeries and a tokens module is a great idea. If we implement that , the clean for Asian languages can be like this,

from texthero.lang import hero_cn as hero

[hero.preprocessing.fillna,
hero.preprocessing.remove_whitespace,
...,
hero.preprocessing.tokenize,
hero.remove(s, hero.tokens.stopwords.union(punctuation).union(...) )]

Then the cleaning process of Western and Asian will be unified. What do you think?

jbesomi commented 4 years ago

Sounds good!

We might call the tokens module collections or something similar:

from texthero.collections import stopwords
from texthero.collections import punctuation
...

Yes, TokenSeries seems a promising idea. @henrifroese is working on it. If you are interested have a look at #60.

@AlfredWGA How do you suggest we proceed, in relation on how you plan to contribute.

ryangawei commented 4 years ago

I'll start implementing texthero.lang.hero_cn to make all functions support Chinese. If #60 is completed I'll refactor the code to accommodate this feature. Is that OK?

jbesomi commented 4 years ago

OK!