jbesomi / texthero

Text preprocessing, representation and visualization from zero to hero.
https://texthero.org
MIT License
2.88k stars 240 forks source link

Chinese language support #68

Open ryangawei opened 4 years ago

ryangawei commented 4 years ago

Hello, my name is Guoao Wei. I am a Chinese student interested in NLP and I can help with the Chinese language support for this amazing repository.

About me

I received a bachelor's degree of Software Engineering in China. I worked as a research intern in the Chinese Chinese Academy of Sciences for a year, focusing on NLP-related topics.

I have been searching for tools that saves time on writing redundant preprocessing codes when dealing with text data (I wrote my own simple one AlfredWGA/nlputils), until I find Texthero. Therefore I am happy to contribute to this toolkit.

Thins I can do

jbesomi commented 4 years ago

Hi Guoao!

Thank you for your message and for your help! We are pleased to have you there!!

For the "things, you can do", which one would you be more interested to start with?

For "Translate documents & tutorials into Chinese" this is very cool and very useful, first, though, we will need to set-up a system that allows this kind of translation. Basically, using Sphinx internationalization.

Unfortunately, I do not speak nor write Chinese and I never did NLP on Chinese texts. What are the most popular Chinese NLP python tools Chinese NLP developers are using now? Do you think it will be possible to develop something like that

import texthero.zh_cn as hero
hero.clean(df['text']) 
...

Also, do you think we will need to provide support for zh-CN (Chinese Simplified) or zh-TW (Chinese Traditional) or both?

Regards,

ryangawei commented 4 years ago

I would prefer to start with adding Chinese support for the preprocessing module.

The most common Chinese NLP tools right now should be jieba, HanLP, and pkuseg. Also, spaCy has integrated jieba and pkuseg into its Chinese language support.

I think creating a distinct module texthero.zh_cn for Chinese is viable, but not necessary. When preprocessing Chinese corpus we follow the same procedure as other languages, except we need a specific tools to do word segmentation. So maybe we can add a config parameter for user to choose the language? for example,

import texthero as hero
hero.config.set_lang('zh_cn')
df['clean_text'] = hero.clean(df['text'])

Simplified and Traditional Chinese shares almost the same grammar and usage, but with some difference on specific words. For example, "software" are written as "软件" and "軟體" (simplified as "软体"). So we usually simply transform traditional characters to the simplified ones and vice versa. Since most of the Chinese NLP tools are developed on simplified Chinese, I think support for zh-CN should be more prioritized.

jbesomi commented 4 years ago

Hey! Thank you for your exhaustive answer!

Adding Chinese support sounds super good to me!

Before starting with the details of the implementation, we need to figure out how Texthero should provide multi-language support. I opened a discussion-issue there #84. What are your opinion on that? Imagine for instance we have a Pandas Series composed of 4 different languages ... what would be the elegant and easiest solution to do text segmentation there?

jbesomi commented 4 years ago

Hey @AlfredWGA! Great, thank you!

Is that _wordsegmentation the same as tokenize()? Would you like to work on that? For simplicity, I would suggest that we start implementing this in a separate file (like preprocessing_zh.py) and only after we will focus on the config part; to reduce merge conflicts and keep the code simple.

Looking forward to seeing what you come up with! Regards,

ryangawei commented 4 years ago

@jbesomi I'm also confused about the difference. Here says that word segmentation is a prior process of tokenization, but in practice we think they're almost equivalent.

preprocessing_zh.py sounds good, would you like to have all functions in preprocessing.py with a Chinese version, or just some specific functions? As some functions also work well in Chinese.

jbesomi commented 4 years ago

Hey @AlfredWGA!

For simplicity and to keep the same pipeline as the other languages, I would say to consider word segmentation as simply tokenization.

Regarding your question, probably it make sense to just focus on the specific functions we need for the Chinese language.

Some general remarks:

In principle, we prefer to install as few external packages as possible. Using jieba for text tokenization seems good to me as it does not have any other dependency (but we might not need this either, see the comments below)

If you look at the Texthero's source code (for example nlp.py) you will discover that most of Texthero's code is just a wrapper around Pandas and spaCy. spaCy has models for many languages, including Chinese. I'm not sure, but it might natively support already text tokenization ... ?

import spacy

try:
    # If not present, download 'en_core_web_sm'
    spacy_model = spacy.load("zh_core_web_sm")
except OSError:
    from spacy.cli.download import download as spacy_download
    spacy_download("zh_core_web_sm")

def tokenize_with_spacy(s: pd.Series) -> pd.Series:

    nlp = spacy.load("en_core_web_sm")# disable=["ner", "tagger", "parser"])

    tokenized = []
    for doc in nlp.pipe(s):
        tokenized.append(list(map(str, doc)))

    return pd.Series(tokenized, index=s.index)

Just by replacing "en_core_web_sm" with "zh_core_web_sm" we might be able to tokenize (word segment) Chinese text. It's that true?

For curiosity, I also looked at the Chinese stopwords:

from spacy.lang.zh import stop_words as spacy_zh_stopwords
spacy_zh_stopwords.STOP_WORDS

the output is like

{'般的',
 '屡次三番',
 '自从',
 '[①①]',
 ')、',
 '毫无例外',
 '替',
 '举凡',
 '附近',
 '|', ...

Do you think we can basically just load the "zh_core_web_sm" instead of "en_core_web_sm" and we will be able to provide Chinese support? It would be great if you try that on a Jupyter Notebook and see how it works. I cannot do it myself as I don't know how an NLP pipeline with Chinese text work.

Other than word segment, which other pipe parts are required when preprocessing Chinese text? In #84 you were mentioning remove_non_natural_text. What does remove_non_natural_text would do? I assume that this somehow replaces remove_punctuation and/or similar hero functions.

Thanks! 🎉

ryangawei commented 4 years ago

Hi @jbesomi. You are correct. I read about this issue, seems zh_core_web_sm do includes a word segmenter, which is trained by OntoNotes dataset with gold segmentation.

Also, as I mentioned above spaCy has provided another API for user to directly call jieba and pkuseg (https://spacy.io/usage/models#chinese) for word segmentation.

import spacy
from spacy.lang.zh import Chinese

nlp1 = spacy.load("zh_core_web_sm")
nlp2 = Chinese() # jieba by default
text = "西门子将努力参与中国的三峡工程建设。"
doc1 = nlp1(text)
print('/'.join([t.text for t in doc1]))
doc2 = nlp2(text)
print('/'.join([t.text for t in doc2]))

The output is

西门子/将/努力/参与/中国/的/三峡/工程/建设/。
西门子/将/努力/参与/中国/的/三峡工程/建设/。

Although using zh_core_web_sm is the simplest way, there hasn't been much discussion about its performance on word segmentation. On the other hand, jieba and pkuseg have been used in many Chinese NLP works, jieba is faster, pkuseg is slow but with high accuracy. So I think we can do this,

def tokenize_with_spacy(s: pd.Series, tokenizer='spacy') -> pd.Series:

And let the user to decide which segmenter they want to use.

I just look through Texthero's API and I think Chinese can use the same pipe parts as the current ones.

jbesomi commented 4 years ago

Thank you for your exhaustive replay @AlfredWGA!

For you to know, right now hero.tokenize() is regex-based, this as some flaws and I was seriously considering to replace it with the code I posted before (the one ofhero.tokenize_with_spacy()).

I'm okay in letting the user select which segmenter to use.

def tokenize(s: pd.Series, tokenizer="spacy") -> pd.Series ...

where tokenizer valid arguments are:

  1. "spacy"
  2. "jieba"
  3. "pkuseg" (in future)

This goes a bit in the opposite direction of the general purpose of Texthero. The idea of Texthero is that we evaluate the different option ("spacy", "jieba", "pkuseg", etc) and pick the best for the user; so that he does not have to make a choice. This both goes with the assumption that the users do not know which one is better (exactly as us), that we tested and picked the best and that in most cases any solution is good (for specific problems, Texthero does not work either ...)

Having said that, it would be even better if we do a comparison of the three tokenizers (or we find some papers or articles that compare them) and we pick the best one.

Also, if we find that spaCy at the end is pretty good, we might just find a solution to automatically detect the language and replace the model. This can already allow us to deal with almost any language, not just Chinese or English (isn't awesome?). For reference, we are also working on a function that detects the lang of each cell in a Pandas Series (#3), this might be pretty useful!

ryangawei commented 4 years ago

Hi, @jbesomi. You've made a good point.

The Chinese model of spaCy was originally released from howl-anderson/Chinese_models_for_SpaCy, however there hasn't been any info about its performance compared to other tools.

jieba is stable and has lots of users (over 23.6k stars), I would suggest that we take jieba API from spaCy for starters, and see how spaCy's Chinese model performs later?

jbesomi commented 4 years ago

Hey @AlfredWGA, I gave a quick look to jieba; as you are saying it seems easy to use and already very embraced by the community. For me is a clear yes!

Your idea of setting the language through config sounds also good to me!

I'm looking forward to seeing your implementation. For any question do not hesitate to ask! 😃

ryangawei commented 4 years ago

I just found in https://spacy.io/usage/models#chinese that spaCy's Chinese model is a custom pkuseg trained on OntoNotes 5.0. Sounds good but I'll still go with jieba first and see if we need to make multilingual codes consistent with spaCy **_core_web_** models.