Closed aymen-souid closed 2 years ago
Sorry, I don't know if Stanford Parser supports Arabic, so I might not be of great help here.
Maybe this can help: https://nlp.stanford.edu/software/parser-arabic-faq.html
UPD: On second thought, dependency trees are only necessary to find span heads for training. If you can modify the code in such a way so that it uses the constituency syntax to find span heads, you should be good to go.
thanks a lot for your reply , so as I can understand I should modify in the Java code from the nlp.stanford that parses the data ?
I think it might be easier to disable conversion to dependencies in convert_to_jsonlines.py and use the constituency data to find span heads in convert_to_heads.py.
the execution of convert_to_jsonlines.py depend on many .dep files , are these ones generated by Stanford parser ? because I'm trying to avoid conversion to dependencies but the code keep generating errors like this: No such file or directory: 'temp/data/conll2012/v4/data/development/data/arabic/annotations/nw/ann/00/ann_0010.v4_gold_conll_dep'
do you think that using the traditional method to convert data to jsonlnes works for this project? like this one https://github.com/kentonl/e2e-coref/blob/master/minimize.py
Because you are not going to be using those *dep files, you should disable the parts related to reading/writing those files and to saving ["head"], ["pos"] and ["deprel"] keys in the output. Those key-value pairs are not needed for training, they are only needed to obtain span heads (convert_to_heads.py). You might need to write your own convert_to_heads.
Please refer to this issue (https://github.com/vdobrovolskii/wl-coref/issues/10) for details on what the model expects for training.
And I do not think that https://github.com/kentonl/e2e-coref/blob/master/minimize.py will work.
Okey I'll try my best then , thank you so much for your help
good luck! let me know if I can help you with anything
Hello again !
so after creating the jsonlines files ,I didn't get how to extract heads from them, do you have any idea how to get them, can you show me how I should convert the data or any path I should take into consideration to make the extraction possible ?
For English, the head of a span is defined to be the only word inside the span that depends on a word outside the span. If there are no or more than one such words, then the rightmost word is chosen to be the span head. I'm not sure about Arabic, as I have little knowledge about its syntax. Maybe you will be good to go with some simple heuristics (like take the rightmost or the leftmost word). Most probably, however, you will need the constituency data available in OntoNotes. Its description goes like this:
This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the "([pos] [word])" string (or leaf) and concatenating the items in the rows of that column.
So you can obtain the constituency tree for each sentence and then use it to find span heads. For instance, in the following bit: \
(NP(NP (DT the)(JJ first)(NN man))(PP (IN of) (NNP Enterprise)))
you see that the outer NP consists of (NP (NP *)(PP *))
. You then search for the head in the inner NP (the first man), which will for NPs is a noun (here: man).
The output of EnglishGrammaticalStructure is close to this: https://universaldependencies.org/format.html For finding span heads, one is mostly interested in the HEAD column.
Hi,
I can't find the ArabicGrammaticalStructure class from the nlp.stanford. It works for english data but not for Arabic .
Converting constituents to dependencies... development: 0% 0/44 [00:00<?, ?docs/s]Exception in thread "main" java.lang.IllegalArgumentException: No head rule defined for PV+PVSUFF using class edu.stanford.nlp.trees.SemanticHeadFinder in PV+PVSUFF-39 at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineNonTrivialHead(AbstractCollinsHeadFinder.java:222) at edu.stanford.nlp.trees.SemanticHeadFinder.determineNonTrivialHead(SemanticHeadFinder.java:348) at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(AbstractCollinsHeadFinder.java:179) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:476) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.TreeGraphNode.percolateHeads(TreeGraphNode.java:474) at edu.stanford.nlp.trees.GrammaticalStructure.(GrammaticalStructure.java:94)
at edu.stanford.nlp.trees.EnglishGrammaticalStructure.(EnglishGrammaticalStructure.java:86)
at edu.stanford.nlp.trees.EnglishGrammaticalStructure.(EnglishGrammaticalStructure.java:66)
at edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams.getGrammaticalStructure(EnglishTreebankParserParams.java:2271)
at edu.stanford.nlp.trees.GrammaticalStructure$TreeBankGrammaticalStructureWrapper$GsIterator.primeGs(GrammaticalStructure.java:1361)
at edu.stanford.nlp.trees.GrammaticalStructure$TreeBankGrammaticalStructureWrapper$GsIterator.(GrammaticalStructure.java:1348)
at edu.stanford.nlp.trees.GrammaticalStructure$TreeBankGrammaticalStructureWrapper.iterator(GrammaticalStructure.java:1325)
at edu.stanford.nlp.trees.GrammaticalStructure.main(GrammaticalStructure.java:1604)
development: 0% 0/44 [00:00<?, ?docs/s]
Traceback (most recent call last):
File "convert_to_jsonlines.py", line 392, in
convert_con_to_dep(args.tmp_dir, conll_filenames)
File "convert_to_jsonlines.py", line 195, in convert_con_to_dep
subprocess.run(cmd, check=True, stdout=out)
File "/home/souid/anaconda3/envs/wl-coref/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['java', '-cp', 'downloads/stanford-parser.jar', 'edu.stanford.nlp.trees.EnglishGrammaticalStructure', '-basic', '-keepPunct', '-conllx', '-treeFile', 'temp/data/conll-2012/v4/data/development/data/arabic/annotations/nw/ann/00/ann_0010.v4_gold_conll']' returned non-zero exit status 1.