about chinese dataset - Githubissues

vdobrovolskii / wl-coref

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

MIT License

104 stars 37 forks source link

about chinese dataset #20

Closed leileilin closed 2 years ago

leileilin commented 2 years ago

Hello, thank you for your great work of open source. I want to process Chinese datasets according to your process, but in convert to jsonlines.py. Py this step reports an error, do you know why? Thanks.

vdobrovolskii commented 2 years ago

Hi!

Could you please post the error stack trace?

leileilin commented 2 years ago

Hi!

Could you please post the error stack trace? here is: subprocess.CalledProcessError: Command '['java', '-cp', 'downloads/stanford-parser.jar', 'edu.stanford.nlp.trees.EnglishGrammaticalStructure', '-basic', '-keepPunct', '-conllx', '-treeFile', 'temp/data/conll-2012/v4/data/development/data/chinese/annotations/bc/cctv/00/cctv_0000.v4_gold_conll']' returned non-zero exit status 1.

leileilin commented 2 years ago

Hi!

Could you please post the error stack trace?

I think this is caused by not choosing a Chinese parser, but I don't know where to start.

vdobrovolskii commented 2 years ago

have you tried manually running java -cp downloads/stanford-parser.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -keepPunct -conllx -treeFile temp/data/conll-2012/v4/data/development/data/chinese/annotations/bc/cctv/00/cctv_0000.v4_gold_conll? Also, note that the parameters to the java module use EnglishGrammaticalStructure, while for Chinese you might need something like ChineseGrammaticalStructure (check the docs to be sure)

leileilin commented 2 years ago

edu.stanford.nlp.trees

here is the point i feel confused, i change it into edu.stanford.nlp.trees.GrammaticalStructure, but still get the following error: Exception in thread "main" java.lang.IllegalArgumentException: No head rule defined for DNP using class edu.stanford.nlp.trees.SemanticHeadFinder in DNP-27

vdobrovolskii commented 2 years ago

I am not sure how to do it with Chinese, but have a look here, it might help: https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/international/pennchinese/ChineseGrammaticalStructure.html

leileilin commented 2 years ago

I am not sure how to do it with Chinese, but have a look here, it might help: https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/international/pennchinese/ChineseGrammaticalStructure.html

Thank you for your answer. It can indeed be successfully implemented, but the following similar errors will occur: Correcting error: treebank tree is not phrasal; wrapping in FRAG: (PU －－)

leileilin commented 2 years ago

have you tried manually running java -cp downloads/stanford-parser.jar edu.stanford.nlp.trees.EnglishGrammaticalStructure -basic -keepPunct -conllx -treeFile temp/data/conll-2012/v4/data/development/data/chinese/annotations/bc/cctv/00/cctv_0000.v4_gold_conll? Also, note that the parameters to the java module use EnglishGrammaticalStructure, while for Chinese you might need something like ChineseGrammaticalStructure (check the docs to be sure)

Another problem is that the document does not describe the role of these parameters. Where did you learn from?

vdobrovolskii commented 2 years ago

Correcting error: treebank tree is not phrasal; wrapping in FRAG: (PU －－) Does it occur on all the documents?

leileilin commented 2 years ago

Correcting error: treebank tree is not phrasal; wrapping in FRAG: (PU －－) Does it occur on all the documents?

Just some sentences in the document

vdobrovolskii commented 2 years ago

Hm. If it's just a couple of sentences, why don't you ignore this error and see if everything else works?

leileilin commented 2 years ago

Hm. If it's just a couple of sentences, why don't you ignore this error and see if everything else works?

The parsing results of those sentences are wrong, so I directly discard them.

leileilin commented 2 years ago

Hm. If it's just a couple of sentences, why don't you ignore this error and see if everything else works?

The parsing results of those sentences are wrong, so I directly discard them.

In fact, my practice has shortcomings, because I destroy the integrity of the data.

vdobrovolskii commented 2 years ago

It really comes down to the percentage of such sentences. What is it?