Open mahdiman opened 4 years ago
This would certainly be a useful feature to have. @AngledLuffa I am not familiar with the enhanced dependency implementation in CoreNLP - how difficult do you think this is?
I'm not very familiar with the dependency parser implementation in stanza. Does it allow multiple connections for a dependent? If not, we would need to write a dependency converter or reuse the corenlp converter. Reusing sounds like the better option
@AngledLuffa It could be adapted to allow multiple connections per depedent, but doesn't support it out of the box.
it's a question of would we rather use the java server for accessing the converter or allow multiple connections per dependent. i think reimplementing it in python would be the worst possible solution
Why is that? Is it due to the complexity of the task itself? Complexity aside, I feel that a native Python implementation will be better integrated with the neural pipeline, since users never need to leave the Stanza Python environment. We can ideally have it as a processor that takes the depparse
output and grow some new annotations to the document.
I was thinking in terms of having to repeat all of the logic involved in the java version
If the formalism changes or we come up with an improvement, we'd need to remember to redo it on both sides
A more generalizable way of doing it would be to write the conversion as a sequence of rules which can be applied in both java & python, I suppse
This is a great point. Maintaining could be an issue going forward. Does the CoreNLP converter support external dependency annotations? If so, what format?
Unfortunately, no, which is part of why the easiest solution by far would be to leverage the existing converter
Looks like this is what we want? https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/trees/ud/UniversalEnhancer.java
Is there any way we can build a server interface for this UniversalEnhancer within CoreNLP, such that the server may be able to take in a CoNLL-U or json representation, and return an enhanced serialization of the graph?
It's a little more complicated than that. In English, there's a more specialized version in trees.GrammaticalStructure.java:
public List
this winds up calling some specialized for English code in UniversalEnglishGrammaticalStructure.java
@Override
protected void addEnhancements(List
I believe there was an attempt at doing the same thing in Chinese, although I have no idea how good it is. I don't believe any other language has the specialized conversions.
Contacting Chris or Sebastian would get more information - I'll drop Sebastian a note and maybe ask Chris at my next meeting if we don't figure it out by then. At any rate, adding a way of doing this via the server is certainly doable.
Yes sounds like a good plan to me. The UD enhanced dependency page does suggest a handful of language-independent rules for conversion, so it makes sense to have a language-independent conversion module in CoreNLP server going forward.
I investigated this some, and it sounds like UniversalEnhancer could indeed be used to add enhancements to any language. However, this needs a language specific list of relativizing pronouns. For example, in English, the list looks like
public static final String RELATIVIZING_WORD_REGEX = "(?i:that|what|which|who|whom|whose)";
Without that, the initial step would be impossible, and once that is done incorrectly several of the later steps would be negatively affected as well. In other words, you would still get some sort of result, but it wouldn't be nearly as useful. Is it possible to provide such a list for other languages?
There's also the question of whether the English and Chinese specific versions are better than the generic one. I'm sure it would be for English, but not so sure about Chinese.
Some languages, like the upcoming Chukchi treebank also have enhanced dependencies in the annotation. It would be great to be able to train on those too.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Alright, this is a really old issue, but I figured it would be nice to add this as a feature before the next release. What I did was work on a python interface to the java UniversalEnhancer code. However, there's a limitation: some languages such as Chinese don't have relative pronouns, so the relative clauses can't be built with the mechanism used in this code
https://en.wikipedia.org/wiki/Relative_clause#Chinese https://en.wikipedia.org/wiki/Relative_pronoun#Absence
Any suggestions on how to handle ref dependencies or add relative clauses there would be appreciated - this is not my strength.
interface is here. will work on the python side of it as well
https://github.com/stanfordnlp/CoreNLP/pull/1148/commits/c548e6c6a7c20fa2fc82d35fe399ccc887c78ec9
This is still a work in progress, with some more testing etc necessary, but it should be usable now:
java side, needs to be recompiled:
https://github.com/stanfordnlp/corenlp/tree/ud_enhancer https://github.com/stanfordnlp/CoreNLP/blob/ud_enhancer/src/edu/stanford/nlp/trees/ud/ProcessUniversalEnhancerRequest.java
python interface:
https://github.com/stanfordnlp/stanza/tree/ud_enhancer_v2 https://github.com/stanfordnlp/stanza/blob/ud_enhancer_v2/stanza/server/ud_enhancer.py
if any of those branches stop existing in the future, it's because they've been merged into dev or possibly even main
Currently we do not support enhancing relative clauses in Chinese, fwiw. For English, it will use that/which/etc, for Chinese, it will skip relative clauses, and for other languages, it will complain and ask you to provide a regex which processes relative clause pronouns.
The above links shows 404 not found error. Can you please give exact url or changes need to be made in order to get enhanced dependencies?
The CoreNLP changes are now included in the most recent release:
https://stanfordnlp.github.io/CoreNLP/
As of this comment, the stanza changes are in the dev branch:
https://github.com/stanfordnlp/stanza/tree/dev
I expect to release a new version of stanza (including these changes) in the next week or so.
Ok Thanks
Hi @AngledLuffa, when trying to use the enhancer:
import stanza.server.ud_enhancer as ud_enhancer
ud_enhancer.process_doc(doc, language="en")
The following errors are reported:
/usr/local/lib/python3.7/dist-packages/stanza/server/ud_enhancer.py in process_doc(doc, language, pronouns_pattern)
49 def process_doc(doc, language=None, pronouns_pattern=None):
50 request = build_enhancer_request(doc, language, pronouns_pattern)
---> 51 return send_request(request, Document, ENHANCER_JAVA, "$CLASSPATH")
52
53 class UniversalEnhancer(JavaProtobufContext):
/usr/local/lib/python3.7/dist-packages/stanza/server/java_protobuf_requests.py in send_request(request, response_type, java_main, classpath)
12 input=request.SerializeToString(),
13 stdout=subprocess.PIPE,
---> 14 check=True)
15 response = response_type()
16 response.ParseFromString(pipe.stdout)
/usr/lib/python3.7/subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
486 kwargs['stderr'] = PIPE
487
--> 488 with Popen(*popenargs, **kwargs) as process:
489 try:
490 stdout, stderr = process.communicate(input, timeout=timeout)
/usr/lib/python3.7/subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text)
798 c2pread, c2pwrite,
799 errread, errwrite,
--> 800 restore_signals, start_new_session)
801 except:
802 # Cleanup if the child failed starting.
/usr/lib/python3.7/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)
1480 errread, errwrite,
1481 errpipe_read, errpipe_write,
-> 1482 restore_signals, start_new_session, preexec_fn)
1483 self._child_created = True
1484 finally:
TypeError: expected str, bytes or os.PathLike object, not NoneType
Any hints on how to solve these? Thanks!
What version of CoreNLP do you have? We're missing the other end of the UD Enhancer in the most recent release of CoreNLP, but it's in the dev branch and you could install that instead. Alternatively, it is going to be in the next release of CoreNLP, which should be available within a week anyway.
I am using CoreNLP 4.2.2, which I thought had already incorporated the features. Will wait for the next release then. Thank you!
Getting the same error as reported by victoryhb above while using ud_enhancer. do we need the corenlp server running locally for getting enhanced dependencies using stanza ?
Which version of CoreNLP are you using? You do not need the server running at all, but you do need a recent version of CoreNLP in your classpath.
i have 4.4.0 corenlp and was using it in colab. process_doc gives the error while running the subprocess. however, i was able to use it how it's specified in ud_enhancer.py -
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,lemma,depparse')
with ud_enhancer.UniversalEnhancer(language="en") as enhancer:
depparseFromStanza = nlp("This is a test")
depparseEnhanced = enhancer.process(depparseFromStanza)
I don't really use colab for anything, but hopefully you can figure out how to make it work!