Avoid stripping the input text

stanfordnlp / python-stanford-corenlp

Python interface to CoreNLP using a bidirectional server-client interface.

MIT License

516 stars 105 forks source link

Avoid stripping the input text #38

Closed lcswillems closed 5 years ago

lcswillems commented 5 years ago

Hi,

If I do:

client = corenlp.CoreNLPClient(annotators=('tokenize', 'parse'))
client.annotate(" hi ")

I get:

text: "hi"
sentence {
  token {
    word: "hi"
    pos: "UH"
    value: "hi"
    before: ""
    after: ""
    originalText: "hi"
    beginChar: 0
    endChar: 2
    beginIndex: 0
    endIndex: 1
    tokenBeginIndex: 0
    tokenEndIndex: 1
    hasXmlContext: false
  }
...

The text is hi instead of hi. How is it possible to make CoreNLP stop stripping the input text?

arunchaganty commented 5 years ago

Hmm... what is the motivation for not stripping the whitespace?

Unfortunately the root cause of this problem is in the CoreNLPServer. Text sent to the server is trimmed out of abundance of caution for any whitespace added through the URL encoding/decoding process. It should be ok to simply not do this and the easiest way to do so would be through a local checkout of CoreNLP.

lcswillems commented 5 years ago

Because there is no reason why CoreNLP would strip it... If I want to strip it, I do it. I may want to give it some text that is not stripped.

manning commented 5 years ago

As @arunchaganty notes, the trimming was happening in the Java CoreNLPServer. But it seems to me that there is no good reason why we were doing this, and it is in principle wrong as @lcswillems notes. So, I've removed it. 🙂So, whenever we next make a CoreNLP release (probably end of (northern) summer) or if you cherrypick CoreNLP commit 082ed17b04bdb7ed1ae613916d713942f7c24dfb then this should be fixed.

lcswillems commented 5 years ago

That's great!! Thanks a lot! :)