quanteda / tutorials.quanteda.io

Quanteda tutorials website
https://tutorials.quanteda.io
MIT License
39 stars 54 forks source link

add instructions for Chinese #64

Closed yuanzhouIR closed 3 years ago

yuanzhouIR commented 3 years ago

I wrote some tips for Chinese text analysis based on my own experience. I would be happy if it is useful for other researchers.

koheiw commented 3 years ago

Thank you for the PR. Can you use the preemble of the UDHR in quanteda.corpora? It's new so please install

devtools::install_github("quanteda/quanteda.corpora",  ref = "update-udhr") 
> corp <- corpus_reshape(data_corpus_udhr["cmn_hant"], to = "paragraphs")
> corp[2]
Corpus consisting of 1 document and 4 docvars.
cmn_hant :
"鑑於對人類家庭所有成員的固有尊嚴及其平等的和不移的權利的承認,乃是世界自由、正義與和平的基礎, 鑑於對人權的無視和侮蔑已..."

It is good to have an example with jiebaR, but let's make it a separate section as a more advanced apparoch. I might add an example with RcppMecab in the Japanese page too.

yuanzhouIR commented 3 years ago

Thank you for the PR. Can you use the preemble of the UDHR in quanteda.corpora? It's new so please install

devtools::install_github("quanteda/quanteda.corpora",  ref = "update-udhr") 
> corp <- corpus_reshape(data_corpus_udhr["cmn_hant"], to = "paragraphs")
> corp[2]
Corpus consisting of 1 document and 4 docvars.
cmn_hant :
"鑑於對人類家庭所有成員的固有尊嚴及其平等的和不移的權利的承認,乃是世界自由、正義與和平的基礎, 鑑於對人權的無視和侮蔑已..."

It is good to have an example with jiebaR, but let's make it a separate section as a more advanced apparoch. I might add an example with RcppMecab in the Japanese page too.

I tried the data_corpus_udhr["chn"](data_corpus_udhr["cmn_hant"] is the traditional Chinese version), however, there seems to be a problem: there are unnecessary spaces between Chinese characters so that the tokenizer cannot segment words. It is easy to fix this problem:

data_corpus_udhr[["chn"]] <- gsub(" ", "", data_corpus_udhr[["chn"]])

Would you please first revise the corpus?

koheiw commented 3 years ago

Earlier version of files had problems, but I think the latest XML files are fine. Can you check?

data_corpus_udhr["cmn_hans"]

https://github.com/quanteda/quanteda.corpora/blob/update-udhr/sources/udhr/udhr_cmn_hans.xml

data_corpus_udhr["chn"] in the master, but please do not use as it is an old broken corpus.

koheiw commented 3 years ago
> print(tokens(data_corpus_udhr["cmn_hans"]), -1, -1)
Tokens consisting of 1 document and 4 docvars.
cmn_hans :
   [1] "序言"     "鉴于"     "对"       "人类"     "家庭"     "所有"     "成员"     "的"       "固有"     "尊严"    
  [11] "及其"     "平等"     "的"       "和"       "不移"     "的"       "权利"     "的"       "承认"     ","       
  [21] "乃是"     "世界"     "自由"     "、"       "正义"     "与"       "和平"     "的"       "基础"     ","       
  [31] "鉴于"     "对"       "人权"     "的"       "无视"     "和"       "侮蔑"     "已"       "发展"     "为"      
  [41] "野蛮"     "暴行"     ","        "这些"     "暴行"     "玷污"     "了"       "人类"     "的"       "良心"    
  [51] ","        "而"       "一个"     "人人"     "享有"     "言论"     "和"       "信仰"     "自由"     "并"      
  [61] "免"       "予"       "恐惧"     "和"       "匮"       "乏"       "的"       "世界"     "的"       "来临"    
  [71] ","        "已"       "被"       "宣布"     "为"       "普通"     "人民"     "的"       "最高"     "愿望"    
  [81] ","        "鉴于"     "为"       "使"       "人类"     "不致"     "迫不得已" "铤"       "而"       "走"      
  [91] "险"       "对"       "暴政"     "和"       "压迫"     "进行"     "反叛"     ","        "有"       "必要"    
 [101] "使"       "人权"     "受"       "法治"     "的"       "保护"     ","        "鉴于"     "有"       "必要"    
 [111] "促进"     "各国"     "间"       "友好"     "关系"     "的"       "发展"     ","        "鉴于"     "各"      
 [121] "联合"     "国"       "国家"     "的"       "人民"     "已"       "在"       "联合"     "国"       "宪章"    
koheiw commented 3 years ago

I updated the master branch of quanteda.corpora too.

yuanzhouIR commented 3 years ago

I updated the master branch of quanteda.corpora too.

I made an analysis example using data_corpus_udhr["cmn_hans"]. The segment accuracy is good but not perfect. For example, 联合国 (United Nations) is seperated to "联合" "国".

tokens("联合国") Tokens consisting of 1 document. text1 : [1] "联合" "国"

I think we can work on how to produce more accurate tokens for Chinese and Japanese in the future.

koheiw commented 3 years ago

Thanks. Have you tried collocation analysis to compound such words? I am curious if it works in Chinese.

yuanzhouIR commented 3 years ago

Thanks. Have you tried collocation analysis to compound such words? I am curious if it works in Chinese.

I tried but it seems not working.

print(tstat_col) collocation count count_nested length lambda z 1 人人 有权 12 0 2 4.229453 9.495899 2 有权 享受 9 0 2 4.785270 8.580531 3 任何人 不得 4 0 2 4.741903 6.762994 4 宣言 所 3 0 2 5.266138 6.470412 5 有权 享有 4 0 2 4.430713 5.977755 6 促进 各国 2 0 2 6.140192 5.891164 7 充分 实现 2 0 2 6.140192 5.891164 8 此 项 2 0 2 6.988721 5.769537 9 鉴于 各 2 0 2 5.349887 5.741791 10 本 宣言 6 0 2 9.552440 5.687257 11 联合 国 6 0 2 9.040998 5.656746 12 不得 任意 3 0 2 5.708969 5.617040 13 法律 所 2 0 2 4.585277 5.419745 14 平等 保护 2 0 2 4.585277 5.419745 15 应 促进 2 0 2 4.608581 5.240781 16 国家 努力 2 0 2 5.548705 5.146251 17 教育 应 2 0 2 4.155357 5.078872 18 社会 保障 2 0 2 4.712012 5.053012 19 享受 法律 2 0 2 4.195065 5.025142 20 享受 公正 2 0 2 5.295536 4.956768 21 任意 剥夺 2 0 2 8.087948 4.832943 22 刑事 罪 2 0 2 8.087948 4.832943 23 此 种 2 0 2 8.087948 4.832943 24 所 载 3 0 2 7.465831 4.779336 25 各 会员 2 0 2 7.750861 4.729011 26 违背 联合 2 0 2 7.498931 4.630349 27 目的 在于 2 0 2 9.698614 4.623310 28 誓 愿 2 0 2 9.698614 4.623310 29 会员 国 2 0 2 7.129975 4.461062 30 不得 加以 3 0 2 6.808200 4.425827 31 有权 被 2 0 2 3.764454 4.423951 32 权利 包括 2 0 2 3.690445 4.344925 33 项 权利 2 0 2 4.538993 4.323087 34 自由 选择 2 0 2 4.503882 4.291961 35 基本 自由 2 0 2 3.403392 4.199649 36 所 有权 2 0 2 2.999812 3.967452 37 有权 自由 2 0 2 1.616449 2.352161

The above words should not be compounded.

koheiw commented 3 years ago

I cannot say much about the language but at least I can see "联合 国"! Did you remove stopwords before applying collocation analysis with padding = TRUE?

yuanzhouIR commented 3 years ago

I cannot say much about the language but at least I can see "联合 国"! Did you remove stopwords before applying collocation analysis with padding = TRUE?

Yes. Below is the replication code.

corp <- corpus_reshape(data_corpus_udhr["cmn_hans"], to = "paragraphs")
toks <- tokens(corp, remove_punct = TRUE, remove_numbers = TRUE, padding = TRUE) %>% tokens_remove(stopwords("zh_cn", source = "marimo"), padding = TRUE)

print(toks[2], max_ndoc = 1, max_ntok = -1)

tstat_col <- toks %>% textstat_collocations()

print(tstat_col)

For a few results, it is reasonable to compound (e.g. 联合国, 会员国, 所有权). However, most results are of indenpent words.

koheiw commented 3 years ago

That helped. Thank you. Let's merged and discuss further on a separate branch.