stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.7k stars 2.7k forks source link

Discrepancy tokensregex webserver and CoreNLP in Python? #1075

Open rickbeeloo opened 4 years ago

rickbeeloo commented 4 years ago

Let's take the following setence: organic wastes under variable temperature conditions and pattern: [{tag:/JJ/}]*[{tag:/NN.*/}]+ When we pass this to http://corenlp.run/: image

Then when we do this in Python:

with CoreNLPClient(memory='16G', threads=1, annotators=['tokenize','ssplit','pos','lemma','ner','depparse']) as client:
    text = 'organic wastes under variable temperature conditions'
    print(client.tokensregex(text, '[{tag:/JJ/}]*[{tag:/NN.*/}]+'))

It will print: {'sentences': [{'0': {'text': 'organic wastes', 'begin': 0, 'end': 2}, '1': {'text': 'variable temperature', 'begin': 3, 'end': 5}, '2': {'text': 'conditions', 'begin': 5, 'end': 6}, 'length': 3}]} Note that the webserver finds "variable temperature conditions" whereas in Python we only find "variable temperature" and "conditions" as seperate matches. I need the same output as the webserver

rickbeeloo commented 4 years ago

Upon inspection of the request header for the webserver I noticed it adds a a \ before the last +, so the pattern should be [{tag:/JJ/}]*[{tag:/NN.*/}]\+

AngledLuffa commented 4 years ago

I'm not sure this deserves to be closed. Didn't you have the expectation that the API does the escaping for you? Perhaps we should fix that in the stanza client.

rickbeeloo commented 4 years ago

Yes I indeed expected to be able to copy a regex from corenlp.run and obtain the same results (and thus also the same parsing)

rickbeeloo commented 4 years ago

I also noticed that entering the escaped regex, thus [{tag:/JJ/}]*[{tag:/NN.*/}]\+ on the werbserver will throw java.lang.RuntimeException: Error when parsing [{tag:/JJ/}]*[{tag:/NN.*/}]\+ this makes it even harder to test a regex since then an escaped one seems incorrect when tested on the webserver but correct in code and a non-escaped regex seems correct on the webserver but not in the code

AngledLuffa commented 4 years ago

I thought at first this was on the stanza side, but then I discovered it was an issue with the java code. The result is that although I fixed it, the fix didn't make it into the 4.1.0 version currently being built. It will be available in the next release or on github, though.

rickbeeloo commented 4 years ago

Aaah awesome!

AngledLuffa commented 4 years ago

https://github.com/stanfordnlp/CoreNLP/commit/5e54ae4862d38de1e36020380b9dab48ab73eebc