vdobrovolskii / wl-coref

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"
MIT License
104 stars 37 forks source link

what is the equivalent of "edu.stanford.nlp.trees.EnglishGrammaticalStructure" for arabic coreference resolution task #12

Closed aymen-souid closed 2 years ago

aymen-souid commented 2 years ago

Hi,

I can't find the ArabicGrammaticalStructure class from the nlp.stanford. It works for english data but not for Arabic .

Converting constituents to dependencies... development: 0% 0/44 [00:00<?, ?docs/s]Exception in thread "main" java.lang.IllegalArgumentException: No head rule defined for PV+PVSUFF using class edu.stanford.nlp.trees.SemanticHeadFinder in PV+PVSUFF-39 at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineNonTrivialHead(AbstractCollinsHeadFinder.java:222) at edu.stanford.nlp.trees.SemanticHeadFinder.determineNonTrivialHead(SemanticHeadFinder.java:348) at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(AbstractCollinsHeadFinder.java:179) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:476) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.GrammaticalStructure.(GrammaticalStructure.java:94) at edu.stanford.nlp.trees.EnglishGrammaticalStructure.(EnglishGrammaticalStructure.java:86) at edu.stanford.nlp.trees.EnglishGrammaticalStructure.(EnglishGrammaticalStructure.java:66) at edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams.getGrammaticalStructure(EnglishTreebankParserParams.java:2271) at edu.stanford.nlp.trees.GrammaticalStructure$TreeBankGrammaticalStructureWrapper$GsIterator.primeGs(GrammaticalStructure.java:1361) at edu.stanford.nlp.trees.GrammaticalStructure$TreeBankGrammaticalStructureWrapper$GsIterator.(GrammaticalStructure.java:1348) at edu.stanford.nlp.trees.GrammaticalStructure$TreeBankGrammaticalStructureWrapper.iterator(GrammaticalStructure.java:1325) at edu.stanford.nlp.trees.GrammaticalStructure.main(GrammaticalStructure.java:1604) development: 0% 0/44 [00:00<?, ?docs/s] Traceback (most recent call last): File "convert_to_jsonlines.py", line 392, in convert_con_to_dep(args.tmp_dir, conll_filenames) File "convert_to_jsonlines.py", line 195, in convert_con_to_dep subprocess.run(cmd, check=True, stdout=out) File "/home/souid/anaconda3/envs/wl-coref/lib/python3.7/subprocess.py", line 512, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['java', '-cp', 'downloads/stanford-parser.jar', 'edu.stanford.nlp.trees.EnglishGrammaticalStructure', '-basic', '-keepPunct', '-conllx', '-treeFile', 'temp/data/conll-2012/v4/data/development/data/arabic/annotations/nw/ann/00/ann_0010.v4_gold_conll']' returned non-zero exit status 1.

vdobrovolskii commented 2 years ago

Sorry, I don't know if Stanford Parser supports Arabic, so I might not be of great help here.

Maybe this can help: https://nlp.stanford.edu/software/parser-arabic-faq.html

vdobrovolskii commented 2 years ago

UPD: On second thought, dependency trees are only necessary to find span heads for training. If you can modify the code in such a way so that it uses the constituency syntax to find span heads, you should be good to go.

aymen-souid commented 2 years ago

thanks a lot for your reply , so as I can understand I should modify in the Java code from the nlp.stanford that parses the data ?

vdobrovolskii commented 2 years ago

I think it might be easier to disable conversion to dependencies in convert_to_jsonlines.py and use the constituency data to find span heads in convert_to_heads.py.

aymen-souid commented 2 years ago

the execution of convert_to_jsonlines.py depend on many .dep files , are these ones generated by Stanford parser ? because I'm trying to avoid conversion to dependencies but the code keep generating errors like this: No such file or directory: 'temp/data/conll2012/v4/data/development/data/arabic/annotations/nw/ann/00/ann_0010.v4_gold_conll_dep'

aymen-souid commented 2 years ago

do you think that using the traditional method to convert data to jsonlnes works for this project? like this one https://github.com/kentonl/e2e-coref/blob/master/minimize.py

vdobrovolskii commented 2 years ago

Because you are not going to be using those *dep files, you should disable the parts related to reading/writing those files and to saving ["head"], ["pos"] and ["deprel"] keys in the output. Those key-value pairs are not needed for training, they are only needed to obtain span heads (convert_to_heads.py). You might need to write your own convert_to_heads.

Please refer to this issue (https://github.com/vdobrovolskii/wl-coref/issues/10) for details on what the model expects for training.

And I do not think that https://github.com/kentonl/e2e-coref/blob/master/minimize.py will work.

aymen-souid commented 2 years ago

Okey I'll try my best then , thank you so much for your help

vdobrovolskii commented 2 years ago

good luck! let me know if I can help you with anything

aymen-souid commented 2 years ago

Hello again !

so after creating the jsonlines files ,I didn't get how to extract heads from them, do you have any idea how to get them, can you show me how I should convert the data or any path I should take into consideration to make the extraction possible ?

vdobrovolskii commented 2 years ago

For English, the head of a span is defined to be the only word inside the span that depends on a word outside the span. If there are no or more than one such words, then the rightmost word is chosen to be the span head. I'm not sure about Arabic, as I have little knowledge about its syntax. Maybe you will be good to go with some simple heuristics (like take the rightmost or the leftmost word). Most probably, however, you will need the constituency data available in OntoNotes. Its description goes like this:

This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the "([pos] [word])" string (or leaf) and concatenating the items in the rows of that column.

So you can obtain the constituency tree for each sentence and then use it to find span heads. For instance, in the following bit: \ (NP(NP (DT the)(JJ first)(NN man))(PP (IN of) (NNP Enterprise))) you see that the outer NP consists of (NP (NP *)(PP *)). You then search for the head in the inner NP (the first man), which will for NPs is a noun (here: man).


The output of EnglishGrammaticalStructure is close to this: https://universaldependencies.org/format.html For finding span heads, one is mostly interested in the HEAD column.