stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.25k stars 888 forks source link

Enhanced Dependencies Support #359

Open mahdiman opened 4 years ago

mahdiman commented 4 years ago
yuhaozhang commented 4 years ago

This would certainly be a useful feature to have. @AngledLuffa I am not familiar with the enhanced dependency implementation in CoreNLP - how difficult do you think this is?

AngledLuffa commented 4 years ago

I'm not very familiar with the dependency parser implementation in stanza. Does it allow multiple connections for a dependent? If not, we would need to write a dependency converter or reuse the corenlp converter. Reusing sounds like the better option

qipeng commented 4 years ago

@AngledLuffa It could be adapted to allow multiple connections per depedent, but doesn't support it out of the box.

AngledLuffa commented 4 years ago

it's a question of would we rather use the java server for accessing the converter or allow multiple connections per dependent. i think reimplementing it in python would be the worst possible solution

yuhaozhang commented 4 years ago

Why is that? Is it due to the complexity of the task itself? Complexity aside, I feel that a native Python implementation will be better integrated with the neural pipeline, since users never need to leave the Stanza Python environment. We can ideally have it as a processor that takes the depparse output and grow some new annotations to the document.

AngledLuffa commented 4 years ago

I was thinking in terms of having to repeat all of the logic involved in the java version

If the formalism changes or we come up with an improvement, we'd need to remember to redo it on both sides

AngledLuffa commented 4 years ago

A more generalizable way of doing it would be to write the conversion as a sequence of rules which can be applied in both java & python, I suppse

yuhaozhang commented 4 years ago

This is a great point. Maintaining could be an issue going forward. Does the CoreNLP converter support external dependency annotations? If so, what format?

AngledLuffa commented 4 years ago

Unfortunately, no, which is part of why the easiest solution by far would be to leverage the existing converter

yuhaozhang commented 4 years ago

Looks like this is what we want? https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/trees/ud/UniversalEnhancer.java

Is there any way we can build a server interface for this UniversalEnhancer within CoreNLP, such that the server may be able to take in a CoNLL-U or json representation, and return an enhanced serialization of the graph?

AngledLuffa commented 4 years ago

It's a little more complicated than that. In English, there's a more specialized version in trees.GrammaticalStructure.java:

public List typedDependenciesEnhancedPlusPlus() { List tdl = typedDependencies(Extras.MAXIMAL); addEnhancements(tdl, UniversalEnglishGrammaticalStructure.ENHANCED_PLUS_PLUS_OPTIONS); return tdl; }

this winds up calling some specialized for English code in UniversalEnglishGrammaticalStructure.java

@Override protected void addEnhancements(List list, EnhancementOptions options)

I believe there was an attempt at doing the same thing in Chinese, although I have no idea how good it is. I don't believe any other language has the specialized conversions.

Contacting Chris or Sebastian would get more information - I'll drop Sebastian a note and maybe ask Chris at my next meeting if we don't figure it out by then. At any rate, adding a way of doing this via the server is certainly doable.

yuhaozhang commented 4 years ago

Yes sounds like a good plan to me. The UD enhanced dependency page does suggest a handful of language-independent rules for conversion, so it makes sense to have a language-independent conversion module in CoreNLP server going forward.

AngledLuffa commented 4 years ago

I investigated this some, and it sounds like UniversalEnhancer could indeed be used to add enhancements to any language. However, this needs a language specific list of relativizing pronouns. For example, in English, the list looks like

public static final String RELATIVIZING_WORD_REGEX = "(?i:that|what|which|who|whom|whose)";

Without that, the initial step would be impossible, and once that is done incorrectly several of the later steps would be negatively affected as well. In other words, you would still get some sort of result, but it wouldn't be nearly as useful. Is it possible to provide such a list for other languages?

There's also the question of whether the English and Chinese specific versions are better than the generic one. I'm sure it would be for English, but not so sure about Chinese.

ftyers commented 3 years ago

Some languages, like the upcoming Chukchi treebank also have enhanced dependencies in the annotation. It would be great to be able to train on those too.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

AngledLuffa commented 3 years ago

Alright, this is a really old issue, but I figured it would be nice to add this as a feature before the next release. What I did was work on a python interface to the java UniversalEnhancer code. However, there's a limitation: some languages such as Chinese don't have relative pronouns, so the relative clauses can't be built with the mechanism used in this code

https://en.wikipedia.org/wiki/Relative_clause#Chinese https://en.wikipedia.org/wiki/Relative_pronoun#Absence

Any suggestions on how to handle ref dependencies or add relative clauses there would be appreciated - this is not my strength.

AngledLuffa commented 3 years ago

interface is here. will work on the python side of it as well

https://github.com/stanfordnlp/CoreNLP/pull/1148/commits/c548e6c6a7c20fa2fc82d35fe399ccc887c78ec9

AngledLuffa commented 3 years ago

This is still a work in progress, with some more testing etc necessary, but it should be usable now:

java side, needs to be recompiled:

https://github.com/stanfordnlp/corenlp/tree/ud_enhancer https://github.com/stanfordnlp/CoreNLP/blob/ud_enhancer/src/edu/stanford/nlp/trees/ud/ProcessUniversalEnhancerRequest.java

python interface:

https://github.com/stanfordnlp/stanza/tree/ud_enhancer_v2 https://github.com/stanfordnlp/stanza/blob/ud_enhancer_v2/stanza/server/ud_enhancer.py

if any of those branches stop existing in the future, it's because they've been merged into dev or possibly even main

AngledLuffa commented 3 years ago

Currently we do not support enhancing relative clauses in Chinese, fwiw. For English, it will use that/which/etc, for Chinese, it will skip relative clauses, and for other languages, it will complain and ask you to provide a regex which processes relative clause pronouns.

CJPJ007 commented 3 years ago

The above links shows 404 not found error. Can you please give exact url or changes need to be made in order to get enhanced dependencies?

AngledLuffa commented 3 years ago

The CoreNLP changes are now included in the most recent release:

https://stanfordnlp.github.io/CoreNLP/

As of this comment, the stanza changes are in the dev branch:

https://github.com/stanfordnlp/stanza/tree/dev

I expect to release a new version of stanza (including these changes) in the next week or so.

CJPJ007 commented 3 years ago

Ok Thanks

victoryhb commented 3 years ago

Hi @AngledLuffa, when trying to use the enhancer:

import stanza.server.ud_enhancer as ud_enhancer
ud_enhancer.process_doc(doc, language="en")

The following errors are reported:

/usr/local/lib/python3.7/dist-packages/stanza/server/ud_enhancer.py in process_doc(doc, language, pronouns_pattern)
     49 def process_doc(doc, language=None, pronouns_pattern=None):
     50     request = build_enhancer_request(doc, language, pronouns_pattern)
---> 51     return send_request(request, Document, ENHANCER_JAVA, "$CLASSPATH")
     52 
     53 class UniversalEnhancer(JavaProtobufContext):

/usr/local/lib/python3.7/dist-packages/stanza/server/java_protobuf_requests.py in send_request(request, response_type, java_main, classpath)
     12                           input=request.SerializeToString(),
     13                           stdout=subprocess.PIPE,
---> 14                           check=True)
     15     response = response_type()
     16     response.ParseFromString(pipe.stdout)

/usr/lib/python3.7/subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    486         kwargs['stderr'] = PIPE
    487 
--> 488     with Popen(*popenargs, **kwargs) as process:
    489         try:
    490             stdout, stderr = process.communicate(input, timeout=timeout)

/usr/lib/python3.7/subprocess.py in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, encoding, errors, text)
    798                                 c2pread, c2pwrite,
    799                                 errread, errwrite,
--> 800                                 restore_signals, start_new_session)
    801         except:
    802             # Cleanup if the child failed starting.

/usr/lib/python3.7/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)
   1480                             errread, errwrite,
   1481                             errpipe_read, errpipe_write,
-> 1482                             restore_signals, start_new_session, preexec_fn)
   1483                     self._child_created = True
   1484                 finally:

TypeError: expected str, bytes or os.PathLike object, not NoneType

Any hints on how to solve these? Thanks!

AngledLuffa commented 3 years ago

What version of CoreNLP do you have? We're missing the other end of the UD Enhancer in the most recent release of CoreNLP, but it's in the dev branch and you could install that instead. Alternatively, it is going to be in the next release of CoreNLP, which should be available within a week anyway.

victoryhb commented 3 years ago

I am using CoreNLP 4.2.2, which I thought had already incorporated the features. Will wait for the next release then. Thank you!

swatiagarwal-s commented 2 years ago

Getting the same error as reported by victoryhb above while using ud_enhancer. do we need the corenlp server running locally for getting enhanced dependencies using stanza ?

AngledLuffa commented 2 years ago

Which version of CoreNLP are you using? You do not need the server running at all, but you do need a recent version of CoreNLP in your classpath.

swatiagarwal-s commented 2 years ago

i have 4.4.0 corenlp and was using it in colab. process_doc gives the error while running the subprocess. however, i was able to use it how it's specified in ud_enhancer.py -

nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,lemma,depparse')
with ud_enhancer.UniversalEnhancer(language="en") as enhancer:
    depparseFromStanza = nlp("This is a test")
    depparseEnhanced = enhancer.process(depparseFromStanza)
AngledLuffa commented 2 years ago

I don't really use colab for anything, but hopefully you can figure out how to make it work!