add instructions for Chinese

yuanzhouIR commented 3 years ago

I wrote some tips for Chinese text analysis based on my own experience. I would be happy if it is useful for other researchers.

koheiw commented 3 years ago

Thank you for the PR. Can you use the preemble of the UDHR in quanteda.corpora? It's new so please install

devtools::install_github("quanteda/quanteda.corpora",  ref = "update-udhr")

> corp <- corpus_reshape(data_corpus_udhr["cmn_hant"], to = "paragraphs")
> corp[2]
Corpus consisting of 1 document and 4 docvars.
cmn_hant :
"鑑於對人類家庭所有成員的固有尊嚴及其平等的和不移的權利的承認，乃是世界自由、正義與和平的基礎， 鑑於對人權的無視和侮蔑已..."

It is good to have an example with jiebaR, but let's make it a separate section as a more advanced apparoch. I might add an example with RcppMecab in the Japanese page too.

yuanzhouIR commented 3 years ago

Thank you for the PR. Can you use the preemble of the UDHR in quanteda.corpora? It's new so please install
devtools::install_github("quanteda/quanteda.corpora",  ref = "update-udhr") 
> corp <- corpus_reshape(data_corpus_udhr["cmn_hant"], to = "paragraphs")
> corp[2]
Corpus consisting of 1 document and 4 docvars.
cmn_hant :
"鑑於對人類家庭所有成員的固有尊嚴及其平等的和不移的權利的承認，乃是世界自由、正義與和平的基礎， 鑑於對人權的無視和侮蔑已..."
It is good to have an example with jiebaR, but let's make it a separate section as a more advanced apparoch. I might add an example with RcppMecab in the Japanese page too.

I tried the data_corpus_udhr["chn"](data_corpus_udhr["cmn_hant"] is the traditional Chinese version), however, there seems to be a problem: there are unnecessary spaces between Chinese characters so that the tokenizer cannot segment words. It is easy to fix this problem:

data_corpus_udhr[["chn"]] <- gsub(" ", "", data_corpus_udhr[["chn"]])

Would you please first revise the corpus?

koheiw commented 3 years ago

Earlier version of files had problems, but I think the latest XML files are fine. Can you check?

data_corpus_udhr["cmn_hans"]

https://github.com/quanteda/quanteda.corpora/blob/update-udhr/sources/udhr/udhr_cmn_hans.xml

data_corpus_udhr["chn"] in the master, but please do not use as it is an old broken corpus.

koheiw commented 3 years ago

> print(tokens(data_corpus_udhr["cmn_hans"]), -1, -1)
Tokens consisting of 1 document and 4 docvars.
cmn_hans :
   [1] "序言"     "鉴于"     "对"       "人类"     "家庭"     "所有"     "成员"     "的"       "固有"     "尊严"    
  [11] "及其"     "平等"     "的"       "和"       "不移"     "的"       "权利"     "的"       "承认"     ","       
  [21] "乃是"     "世界"     "自由"     "、"       "正义"     "与"       "和平"     "的"       "基础"     ","       
  [31] "鉴于"     "对"       "人权"     "的"       "无视"     "和"       "侮蔑"     "已"       "发展"     "为"      
  [41] "野蛮"     "暴行"     ","        "这些"     "暴行"     "玷污"     "了"       "人类"     "的"       "良心"    
  [51] ","        "而"       "一个"     "人人"     "享有"     "言论"     "和"       "信仰"     "自由"     "并"      
  [61] "免"       "予"       "恐惧"     "和"       "匮"       "乏"       "的"       "世界"     "的"       "来临"    
  [71] ","        "已"       "被"       "宣布"     "为"       "普通"     "人民"     "的"       "最高"     "愿望"    
  [81] ","        "鉴于"     "为"       "使"       "人类"     "不致"     "迫不得已" "铤"       "而"       "走"      
  [91] "险"       "对"       "暴政"     "和"       "压迫"     "进行"     "反叛"     ","        "有"       "必要"    
 [101] "使"       "人权"     "受"       "法治"     "的"       "保护"     ","        "鉴于"     "有"       "必要"    
 [111] "促进"     "各国"     "间"       "友好"     "关系"     "的"       "发展"     ","        "鉴于"     "各"      
 [121] "联合"     "国"       "国家"     "的"       "人民"     "已"       "在"       "联合"     "国"       "宪章"

koheiw commented 3 years ago

I updated the master branch of quanteda.corpora too.

yuanzhouIR commented 3 years ago

I updated the master branch of quanteda.corpora too.

I made an analysis example using data_corpus_udhr["cmn_hans"]. The segment accuracy is good but not perfect. For example, 联合国 (United Nations) is seperated to "联合" "国".

tokens("联合国") Tokens consisting of 1 document. text1 : [1] "联合" "国"

I think we can work on how to produce more accurate tokens for Chinese and Japanese in the future.

koheiw commented 3 years ago

Thanks. Have you tried collocation analysis to compound such words? I am curious if it works in Chinese.

yuanzhouIR commented 3 years ago

Thanks. Have you tried collocation analysis to compound such words? I am curious if it works in Chinese.

I tried but it seems not working.

print(tstat_col) collocation count count_nested length lambda z 1 人人有权 12 0 2 4.229453 9.495899 2 有权享受 9 0 2 4.785270 8.580531 3 任何人不得 4 0 2 4.741903 6.762994 4 宣言所 3 0 2 5.266138 6.470412 5 有权享有 4 0 2 4.430713 5.977755 6 促进各国 2 0 2 6.140192 5.891164 7 充分实现 2 0 2 6.140192 5.891164 8 此项 2 0 2 6.988721 5.769537 9 鉴于各 2 0 2 5.349887 5.741791 10 本宣言 6 0 2 9.552440 5.687257 11 联合国 6 0 2 9.040998 5.656746 12 不得任意 3 0 2 5.708969 5.617040 13 法律所 2 0 2 4.585277 5.419745 14 平等保护 2 0 2 4.585277 5.419745 15 应促进 2 0 2 4.608581 5.240781 16 国家努力 2 0 2 5.548705 5.146251 17 教育应 2 0 2 4.155357 5.078872 18 社会保障 2 0 2 4.712012 5.053012 19 享受法律 2 0 2 4.195065 5.025142 20 享受公正 2 0 2 5.295536 4.956768 21 任意剥夺 2 0 2 8.087948 4.832943 22 刑事罪 2 0 2 8.087948 4.832943 23 此种 2 0 2 8.087948 4.832943 24 所载 3 0 2 7.465831 4.779336 25 各会员 2 0 2 7.750861 4.729011 26 违背联合 2 0 2 7.498931 4.630349 27 目的在于 2 0 2 9.698614 4.623310 28 誓愿 2 0 2 9.698614 4.623310 29 会员国 2 0 2 7.129975 4.461062 30 不得加以 3 0 2 6.808200 4.425827 31 有权被 2 0 2 3.764454 4.423951 32 权利包括 2 0 2 3.690445 4.344925 33 项权利 2 0 2 4.538993 4.323087 34 自由选择 2 0 2 4.503882 4.291961 35 基本自由 2 0 2 3.403392 4.199649 36 所有权 2 0 2 2.999812 3.967452 37 有权自由 2 0 2 1.616449 2.352161

The above words should not be compounded.

koheiw commented 3 years ago

I cannot say much about the language but at least I can see "联合国"! Did you remove stopwords before applying collocation analysis with padding = TRUE?

yuanzhouIR commented 3 years ago

I cannot say much about the language but at least I can see "联合国"! Did you remove stopwords before applying collocation analysis with padding = TRUE?

Yes. Below is the replication code.

corp <- corpus_reshape(data_corpus_udhr["cmn_hans"], to = "paragraphs")
toks <- tokens(corp, remove_punct = TRUE, remove_numbers = TRUE, padding = TRUE) %>% tokens_remove(stopwords("zh_cn", source = "marimo"), padding = TRUE)

print(toks[2], max_ndoc = 1, max_ntok = -1)

tstat_col <- toks %>% textstat_collocations()

print(tstat_col)

For a few results, it is reasonable to compound (e.g. 联合国, 会员国, 所有权). However, most results are of indenpent words.

koheiw commented 3 years ago

That helped. Thank you. Let's merged and discuss further on a separate branch.

quanteda / tutorials.quanteda.io

add instructions for Chinese #64