statsmaths / cleanNLP

R package providing annotators and a normalized data model for natural language processing
GNU Lesser General Public License v2.1
209 stars 36 forks source link

problems with corenlp Chinese #52

Closed tianfrank closed 4 years ago

tianfrank commented 4 years ago

I'm trying to parse a Chinese sentence using the stanford parser. But the results are unrecognizable Chinese characters. When I tried to store the Chinese sentence in a txt file and parsed the file the results were still unrecognizable characters. I'm using Rstudio and the default text encoding is UTF-8. Below are the codes and results:

library(cleanNLP) library(rJava) cnlp_init_corenlp("zh", anno_level = 2) chinese_text<-cnlp_annotate("text.txt",as_strings = FALSE) head(cnlp_get_dependency(chinese_text,get_token = TRUE)) A tibble: 5 x 10 id sid tid tid_target relation relation_full word lemma

1 doc1 1 0 1 root root ROOT ROOT 2 doc1 1 1 2 dobj dobj 锘?,锘~ "" 3 doc1 1 1 3 ccomp ccomp 锘?,锘~ "" 4 doc1 1 3 4 dobj dobj 鐖?,鐖~ "" 5 doc1 1 1 5 punct punct 锘?,锘~ "" ... with 2 more variables: word_target , lemma_target

Many thanks!

statsmaths commented 4 years ago

That is odd. Is it possible for you to either make part of the file available or create an example from a string? It's hard to debug without a working example.

tianfrank commented 4 years ago

@statsmaths

Thanks for the reply. Here are the codes and the Chinese sentence is "我爱中国", which means "I love China". Even if the sentence is coded using UTF-8, the problem persists.

x<-"我爱中国。" library(rJava) library(cleanNLP) cnlp_init_corenlp("zh", anno_level = 2) obj <- cnlp_annotate(x,as_strings = TRUE) head(cnlp_get_token(obj))

A tibble: 1 x 8 id sid tid word lemma upos pos cid

1 doc1 1 1 锟揭帮拷锟叫癸拷锟斤拷 锟揭憋拷锟叫癸拷锟斤拷 NA NR 0
statsmaths commented 4 years ago

Thanks for the example. This is a good demonstration of a long problem I had with the way the CoreNLP Java library was written and why I very recently entirely re-wrote the package (see version 3.0.0, now on CRAN). Apologies for the delay; took a while to find the time to address all of the underlying issues.

Using the new version, I get the following after parsing the text:

x<-"我爱中国。"
library(cleanNLP)
cnlp_init_corenlp("zh")
obj <- cnlp_annotate(x)
obj$token
  doc_id sid tid token lemma  upos xpos    feats tid_source relation
1      1   1   1    我    我  PRON  PRP Person=1          2    nsubj
2      1   1   2    爱    爱  VERB   VV        _          0     root
3      1   1   3  中国  中国  VERB   VV        _          2    xcomp
4      1   1   4    。    。 PUNCT    .        _          2    punct

I do not know Chinese at all, but looking up the three tokens (我 and 爱 and 中国) in the Collins Chinese<=>English dictionary seemed to indicate that these related to the words for "me", "love" and "China".

Please let me know if that fixes the problem and if there are any other issues that come up with the new interface!

tianfrank commented 4 years ago

@statsmaths Thanks for the updating of the package. I have tried the original example and it works exactly the same on my computer.

But here comes my another quesion of paring Chinese. In Chinese, there are no spaces between words. While doing dependency parsing, the parser will try to tokenize the text first. However, the tokenization of the sentence is not always correct as what we want (although there exists disagreement as to the segmentation of Chinese words). It would be great if I can tokenize the sentence using another segmentation program with higher accuracy (for example, Jieba https://github.com/fxsjy/jieba) and do dependency parsing using this segmentation, together with my own check of the segmentation done by Jieba. In other words, is it possible that the Chinese dependency parsing is done with already toknized text with spaces between tokens?

Here is an example showing the tokenization problem of the Chinese module of stanford parser:

x<-"北海已成为中国对外开放中升起的一颗明星。" obj <- cnlp_annotate(x) Processed document 1 of 1 obj$token A tibble: 15 x 10 doc_id sid tid token lemma upos xpos feats tid_source relation

1 1 1 1 北海 北海 PROPN NNP _ 3 nsubj 2 1 1 2 已 已 ADV RB _ 3 advmod 3 1 1 3 成 成 VERB VV _ 0 root 4 1 1 4 为 为 VERB VC _ 3 mark 5 1 1 5 中 中 PROPN NNP _ 6 case:suff 6 1 1 6 国 国 PART SFN _ 8 nmod 7 1 1 7 对外 对外 NOUN NN _ 8 nmod 8 1 1 8 开放 开放 NOUN NN _ 10 advmod 9 1 1 9 中 中 ADP IN _ 8 acl 10 1 1 10 升起 升起 VERB VV _ 14 acl:relcl 11 1 1 11 的 的 PART DEC _ 10 mark:relcl 12 1 1 12 一 一 NUM CD NumType=Card 13 nummod 13 1 1 13 颗 颗 NOUN NNB _ 14 clf 14 1 1 14 明星 明星 NOUN NN _ 3 obj 15 1 1 15 。 。 PUNCT . _ 3 punct

An obvious problem with the result is that it seperates the word "中国(China)" into two words (中 and 国).

Even if I try to pre-tokenize the sentence with spaces between words, the problem remains the same and with somewhat different tokenization of the sentence:

x<-"北海 已 成为 中国 对外开放 中 升起 的 一 颗 明星" obj <- cnlp_annotate(x) Processed document 1 of 1 obj$token A tibble: 16 x 10 doc_id sid tid token lemma upos xpos feats tid_source relation

1 1 1 1 北海 北海 PROPN NNP _ 3 nsubj 2 1 1 2 已 已 ADV RB _ 3 advmod 3 1 1 3 成 成 VERB VV _ 0 root 4 1 1 4 为 为 VERB VC _ 3 mark 5 1 1 5 中 中 PROPN NNP _ 6 case:suff 6 1 1 6 国 国 PART SFN _ 7 nsubj 7 1 1 7 对 对 VERB VV _ 12 acl 8 1 1 8 外 外 NOUN NN _ 7 obj 9 1 1 9 开 开 VERB VV _ 12 acl 10 1 1 10 放 放 NOUN NN _ 9 obj 11 1 1 11 中 中 ADP IN _ 10 acl 12 1 1 12 升起 升起 VERB VV _ 16 xcomp 13 1 1 13 的 的 PART DEC _ 12 mark:relcl 14 1 1 14 一 一 NUM CD NumType=Card 15 nummod 15 1 1 15 颗 颗 NOUN NN _ 16 nmod 16 1 1 16 明星 明星 NOUN NN _ 3 obj

It would be tricky to solve the problem and I have found another person's python script for achiving Chinese dependency parsing with pre-tokenized texts (https://www.cnblogs.com/baiboy/p/nltk1.html).

It would be great if cleanNLP could help solve this particular issue of Chinese processing. Thanks very much!

statsmaths commented 4 years ago

Great question. Apparently there is an option to do this with CoreNLP (see issue 24 on the respective repository). I just pushed an update to cleanNLP that allows for passing configuration options directly. If you turn tokenize_pretokenized to TRUE, it assumes that white space indicates tokens.

Here's the way it works without passing the option (still splits 中国 into two tokens):

library(cleanNLP)

x_tok <- "北海 已 成 为 中国 对外 开放 中 升起 的 一 颗 明星 。"

cnlp_init_corenlp("zh")
obj <- cnlp_annotate(x_tok)
obj$token # old token
   doc_id sid tid token lemma  upos xpos        feats tid_source   relation
1       1   1   1  北海  北海 PROPN  NNP            _          3      nsubj
2       1   1   2    已    已   ADV   RB            _          3     advmod
3       1   1   3    成    成  VERB   VV            _          0       root
4       1   1   4    为    为  VERB   VC            _          3       mark
5       1   1   5    中    中 PROPN  NNP            _          6  case:suff
6       1   1   6    国    国  PART  SFN            _         10      nsubj
7       1   1   7    对    对  VERB   VV            _         10        acl
8       1   1   8  外开  外开  NOUN   NN            _          9      nsubj
9       1   1   9  放中  放中  VERB   VV            _         10        acl
10      1   1  10  升起  升起  VERB   VV            _         14  acl:relcl
11      1   1  11    的    的  PART  DEC            _         10 mark:relcl
12      1   1  12    一    一   NUM   CD NumType=Card         13     nummod
13      1   1  13    颗    颗  NOUN  NNB            _         14        clf
14      1   1  14  明星  明星  NOUN   NN            _          3        obj
15      1   1  15    。    。 PUNCT    .            _          3      punct

And here is what happens if you specify the tokenisation:

cnlp_init_corenlp("zh", config=list("tokenize_pretokenized"=TRUE))
obj <- cnlp_annotate(x_tok)
obj$token
   doc_id sid tid token lemma  upos xpos        feats tid_source   relation
1       1   1   1  北海  北海 PROPN  NNP            _          3      nsubj
2       1   1   2    已    已   ADV   RB            _          3     advmod
3       1   1   3    成    成  VERB   VV            _          0       root
4       1   1   4    为    为  VERB   VC            _          3       mark
5       1   1   5  中国  中国 PROPN  NNP            _          7       nmod
6       1   1   6  对外  对外  NOUN   NN            _          7       nmod
7       1   1   7  开放  开放  NOUN   NN            _          9     advmod
8       1   1   8    中    中   ADP   IN            _          7        acl
9       1   1   9  升起  升起  VERB   VV            _         13  acl:relcl
10      1   1  10    的    的  PART  DEC            _          9 mark:relcl
11      1   1  11    一    一   NUM   CD NumType=Card         12     nummod
12      1   1  12    颗    颗  NOUN  NNB            _         13        clf
13      1   1  13  明星  明星  NOUN   NN            _          3        obj
14      1   1  14    。    。 PUNCT    .            _          3      punct

You can also manually split apart sentences by putting them on their own lines, like this:

x_tok_newline <- "北海 已 成 为 中国 对外\n 开放 中 升起 的 一 颗 明星 。"
obj <- cnlp_annotate(x_tok_newline)
obj$token
   doc_id sid tid token lemma  upos xpos        feats tid_source   relation
1       1   1   1  北海  北海 PROPN  NNP            _          3      nsubj
2       1   1   2    已    已   ADV   RB            _          3     advmod
3       1   1   3    成    成  VERB   VV            _          0       root
4       1   1   4    为    为  VERB   VC            _          3       mark
5       1   1   5  中国  中国  NOUN   NN            _          3        obj
6       1   1   6  对外  对外 PUNCT    .            _          3      punct
7       1   2   1  开放  开放  NOUN   NN            _          3     advmod
8       1   2   2    中    中   ADP   IN            _          1        acl
9       1   2   3  升起  升起  VERB   VV            _          7  acl:relcl
10      1   2   4    的    的  PART  DEC            _          3 mark:relcl
11      1   2   5    一    一   NUM   CD NumType=Card          6     nummod
12      1   2   6    颗    颗  NOUN  NNB            _          7        clf
13      1   2   7  明星  明星  NOUN   NN            _          0       root
14      1   2   8    。    。 PUNCT    .            _          7      punct

Note though that there's always a danger of using a different tokenizer than the one used to train the POS model. Here, for example, the token 对外 is marked as punctuation (PUNCT) because it ends a sentence!

Note that the update effect the R and Python packages, so you need to update both (R version 3.0.1 and Python version 1.0.1) from source. The python source is here. Just clone and run pip install . from within the repository. The R package can be installed by running devtools from within R:

devtools::install_github("statsmaths/cleanNLP")
tianfrank commented 4 years ago

I have updated the R package version to 3.0.1 and the python one to 1.0.1, but I got the folllowing error result:

cnlp_init_corenlp("zh", config=list("tokenize_pretokenized"=TRUE)) Error in py_call_impl(callable, dots$args, dots$keywords) : TypeError: init() takes from 1 to 3 positional arguments but 4 were given cnlp_init_corenlp(lang="zh") Error in py_call_impl(callable, dots$args, dots$keywords) : TypeError: init() takes from 1 to 3 positional arguments but 4 were given cnlp_init_corenlp(lang="en") Error in py_call_impl(callable, dots$args, dots$keywords) : TypeError: init() takes from 1 to 3 positional arguments but 4 were given

statsmaths commented 4 years ago

That’s odd. Are you sure that you installed the Python module from the GH source and not from PyPi? You’re getting the exact error I would expect if you installed with the old source.... confusingly both are labelled 1.0.1 at the moment. Will update the GH repo when I can (currently only on my phone).

tianfrank commented 4 years ago

@statsmaths Thanks Arnold! It turns out that I just installed the setup.py instead of the whole repository (and the version is wrongly labelled1.0.1 as you said). Now the codes work perfectly. Thanks again! I think I can currently close the issue now and will push new issues should they arise.