Closed tianfrank closed 4 years ago
That is odd. Is it possible for you to either make part of the file available or create an example from a string? It's hard to debug without a working example.
@statsmaths
Thanks for the reply. Here are the codes and the Chinese sentence is "我爱中国", which means "I love China". Even if the sentence is coded using UTF-8, the problem persists.
x<-"我爱中国。" library(rJava) library(cleanNLP) cnlp_init_corenlp("zh", anno_level = 2) obj <- cnlp_annotate(x,as_strings = TRUE) head(cnlp_get_token(obj))
A tibble: 1 x 8 id sid tid word lemma upos pos cid
Thanks for the example. This is a good demonstration of a long problem I had with the way the CoreNLP Java library was written and why I very recently entirely re-wrote the package (see version 3.0.0, now on CRAN). Apologies for the delay; took a while to find the time to address all of the underlying issues.
Using the new version, I get the following after parsing the text:
x<-"我爱中国。"
library(cleanNLP)
cnlp_init_corenlp("zh")
obj <- cnlp_annotate(x)
obj$token
doc_id sid tid token lemma upos xpos feats tid_source relation
1 1 1 1 我 我 PRON PRP Person=1 2 nsubj
2 1 1 2 爱 爱 VERB VV _ 0 root
3 1 1 3 中国 中国 VERB VV _ 2 xcomp
4 1 1 4 。 。 PUNCT . _ 2 punct
I do not know Chinese at all, but looking up the three tokens (我 and 爱 and 中国) in the Collins Chinese<=>English dictionary seemed to indicate that these related to the words for "me", "love" and "China".
Please let me know if that fixes the problem and if there are any other issues that come up with the new interface!
@statsmaths Thanks for the updating of the package. I have tried the original example and it works exactly the same on my computer.
But here comes my another quesion of paring Chinese. In Chinese, there are no spaces between words. While doing dependency parsing, the parser will try to tokenize the text first. However, the tokenization of the sentence is not always correct as what we want (although there exists disagreement as to the segmentation of Chinese words). It would be great if I can tokenize the sentence using another segmentation program with higher accuracy (for example, Jieba https://github.com/fxsjy/jieba) and do dependency parsing using this segmentation, together with my own check of the segmentation done by Jieba. In other words, is it possible that the Chinese dependency parsing is done with already toknized text with spaces between tokens?
Here is an example showing the tokenization problem of the Chinese module of stanford parser:
x<-"北海已成为中国对外开放中升起的一颗明星。" obj <- cnlp_annotate(x) Processed document 1 of 1 obj$token A tibble: 15 x 10 doc_id sid tid token lemma upos xpos feats tid_source relation
1 1 1 1 北海 北海 PROPN NNP _ 3 nsubj 2 1 1 2 已 已 ADV RB _ 3 advmod 3 1 1 3 成 成 VERB VV _ 0 root 4 1 1 4 为 为 VERB VC _ 3 mark 5 1 1 5 中 中 PROPN NNP _ 6 case:suff 6 1 1 6 国 国 PART SFN _ 8 nmod 7 1 1 7 对外 对外 NOUN NN _ 8 nmod 8 1 1 8 开放 开放 NOUN NN _ 10 advmod 9 1 1 9 中 中 ADP IN _ 8 acl 10 1 1 10 升起 升起 VERB VV _ 14 acl:relcl 11 1 1 11 的 的 PART DEC _ 10 mark:relcl 12 1 1 12 一 一 NUM CD NumType=Card 13 nummod 13 1 1 13 颗 颗 NOUN NNB _ 14 clf 14 1 1 14 明星 明星 NOUN NN _ 3 obj 15 1 1 15 。 。 PUNCT . _ 3 punct
An obvious problem with the result is that it seperates the word "中国(China)" into two words (中 and 国).
Even if I try to pre-tokenize the sentence with spaces between words, the problem remains the same and with somewhat different tokenization of the sentence:
x<-"北海 已 成为 中国 对外开放 中 升起 的 一 颗 明星" obj <- cnlp_annotate(x) Processed document 1 of 1 obj$token A tibble: 16 x 10 doc_id sid tid token lemma upos xpos feats tid_source relation
1 1 1 1 北海 北海 PROPN NNP _ 3 nsubj 2 1 1 2 已 已 ADV RB _ 3 advmod 3 1 1 3 成 成 VERB VV _ 0 root 4 1 1 4 为 为 VERB VC _ 3 mark 5 1 1 5 中 中 PROPN NNP _ 6 case:suff 6 1 1 6 国 国 PART SFN _ 7 nsubj 7 1 1 7 对 对 VERB VV _ 12 acl 8 1 1 8 外 外 NOUN NN _ 7 obj 9 1 1 9 开 开 VERB VV _ 12 acl 10 1 1 10 放 放 NOUN NN _ 9 obj 11 1 1 11 中 中 ADP IN _ 10 acl 12 1 1 12 升起 升起 VERB VV _ 16 xcomp 13 1 1 13 的 的 PART DEC _ 12 mark:relcl 14 1 1 14 一 一 NUM CD NumType=Card 15 nummod 15 1 1 15 颗 颗 NOUN NN _ 16 nmod 16 1 1 16 明星 明星 NOUN NN _ 3 obj
It would be tricky to solve the problem and I have found another person's python script for achiving Chinese dependency parsing with pre-tokenized texts (https://www.cnblogs.com/baiboy/p/nltk1.html).
It would be great if cleanNLP could help solve this particular issue of Chinese processing. Thanks very much!
Great question. Apparently there is an option to do this with CoreNLP (see issue 24 on the respective repository).
I just pushed an update to cleanNLP that allows for passing configuration options
directly. If you turn tokenize_pretokenized
to TRUE, it assumes that white space indicates
tokens.
Here's the way it works without passing the option (still splits 中国 into two tokens):
library(cleanNLP)
x_tok <- "北海 已 成 为 中国 对外 开放 中 升起 的 一 颗 明星 。"
cnlp_init_corenlp("zh")
obj <- cnlp_annotate(x_tok)
obj$token # old token
doc_id sid tid token lemma upos xpos feats tid_source relation
1 1 1 1 北海 北海 PROPN NNP _ 3 nsubj
2 1 1 2 已 已 ADV RB _ 3 advmod
3 1 1 3 成 成 VERB VV _ 0 root
4 1 1 4 为 为 VERB VC _ 3 mark
5 1 1 5 中 中 PROPN NNP _ 6 case:suff
6 1 1 6 国 国 PART SFN _ 10 nsubj
7 1 1 7 对 对 VERB VV _ 10 acl
8 1 1 8 外开 外开 NOUN NN _ 9 nsubj
9 1 1 9 放中 放中 VERB VV _ 10 acl
10 1 1 10 升起 升起 VERB VV _ 14 acl:relcl
11 1 1 11 的 的 PART DEC _ 10 mark:relcl
12 1 1 12 一 一 NUM CD NumType=Card 13 nummod
13 1 1 13 颗 颗 NOUN NNB _ 14 clf
14 1 1 14 明星 明星 NOUN NN _ 3 obj
15 1 1 15 。 。 PUNCT . _ 3 punct
And here is what happens if you specify the tokenisation:
cnlp_init_corenlp("zh", config=list("tokenize_pretokenized"=TRUE))
obj <- cnlp_annotate(x_tok)
obj$token
doc_id sid tid token lemma upos xpos feats tid_source relation
1 1 1 1 北海 北海 PROPN NNP _ 3 nsubj
2 1 1 2 已 已 ADV RB _ 3 advmod
3 1 1 3 成 成 VERB VV _ 0 root
4 1 1 4 为 为 VERB VC _ 3 mark
5 1 1 5 中国 中国 PROPN NNP _ 7 nmod
6 1 1 6 对外 对外 NOUN NN _ 7 nmod
7 1 1 7 开放 开放 NOUN NN _ 9 advmod
8 1 1 8 中 中 ADP IN _ 7 acl
9 1 1 9 升起 升起 VERB VV _ 13 acl:relcl
10 1 1 10 的 的 PART DEC _ 9 mark:relcl
11 1 1 11 一 一 NUM CD NumType=Card 12 nummod
12 1 1 12 颗 颗 NOUN NNB _ 13 clf
13 1 1 13 明星 明星 NOUN NN _ 3 obj
14 1 1 14 。 。 PUNCT . _ 3 punct
You can also manually split apart sentences by putting them on their own lines, like this:
x_tok_newline <- "北海 已 成 为 中国 对外\n 开放 中 升起 的 一 颗 明星 。"
obj <- cnlp_annotate(x_tok_newline)
obj$token
doc_id sid tid token lemma upos xpos feats tid_source relation
1 1 1 1 北海 北海 PROPN NNP _ 3 nsubj
2 1 1 2 已 已 ADV RB _ 3 advmod
3 1 1 3 成 成 VERB VV _ 0 root
4 1 1 4 为 为 VERB VC _ 3 mark
5 1 1 5 中国 中国 NOUN NN _ 3 obj
6 1 1 6 对外 对外 PUNCT . _ 3 punct
7 1 2 1 开放 开放 NOUN NN _ 3 advmod
8 1 2 2 中 中 ADP IN _ 1 acl
9 1 2 3 升起 升起 VERB VV _ 7 acl:relcl
10 1 2 4 的 的 PART DEC _ 3 mark:relcl
11 1 2 5 一 一 NUM CD NumType=Card 6 nummod
12 1 2 6 颗 颗 NOUN NNB _ 7 clf
13 1 2 7 明星 明星 NOUN NN _ 0 root
14 1 2 8 。 。 PUNCT . _ 7 punct
Note though that there's always a danger of using a different tokenizer than the one used to train the POS model. Here, for example, the token 对外 is marked as punctuation (PUNCT) because it ends a sentence!
Note that the update effect the R and Python packages, so you need to update
both (R version 3.0.1 and Python version 1.0.1) from source. The python source is
here. Just clone and run
pip install .
from within the repository. The R package can be installed by running
devtools from within R:
devtools::install_github("statsmaths/cleanNLP")
I have updated the R package version to 3.0.1 and the python one to 1.0.1, but I got the folllowing error result:
cnlp_init_corenlp("zh", config=list("tokenize_pretokenized"=TRUE)) Error in py_call_impl(callable, dots$args, dots$keywords) : TypeError: init() takes from 1 to 3 positional arguments but 4 were given cnlp_init_corenlp(lang="zh") Error in py_call_impl(callable, dots$args, dots$keywords) : TypeError: init() takes from 1 to 3 positional arguments but 4 were given cnlp_init_corenlp(lang="en") Error in py_call_impl(callable, dots$args, dots$keywords) : TypeError: init() takes from 1 to 3 positional arguments but 4 were given
That’s odd. Are you sure that you installed the Python module from the GH source and not from PyPi? You’re getting the exact error I would expect if you installed with the old source.... confusingly both are labelled 1.0.1 at the moment. Will update the GH repo when I can (currently only on my phone).
@statsmaths Thanks Arnold! It turns out that I just installed the setup.py instead of the whole repository (and the version is wrongly labelled1.0.1 as you said). Now the codes work perfectly. Thanks again! I think I can currently close the issue now and will push new issues should they arise.
I'm trying to parse a Chinese sentence using the stanford parser. But the results are unrecognizable Chinese characters. When I tried to store the Chinese sentence in a txt file and parsed the file the results were still unrecognizable characters. I'm using Rstudio and the default text encoding is UTF-8. Below are the codes and results:
Many thanks!