stanfordnlp / python-stanford-corenlp

Python interface to CoreNLP using a bidirectional server-client interface.
MIT License
516 stars 105 forks source link

Percent (%) and following 2 characters are removed, possibly due to URL escape issue #42

Open na2hiro opened 4 years ago

na2hiro commented 4 years ago

%XX is removed from the text when XX is hexadecimal, which looks like a URL escape issue (ref. https://github.com/stanfordnlp/CoreNLP/issues/784). Passing URL-encoded string returns an expected result.

>>> [r.originalText for r in client.annotate("100%absolutely sure".lower()).sentencelessToken]
['100', 'solutely', 'sure']
>>> [r.originalText for r in client.annotate(urllib.parse.quote("100%absolutely sure".lower())).sentencelessToken]
['100', '%', 'absolutely', 'sure']

After I found this bug, I noticed this library is deprecated. You can close this issue, I just wanted to navigate people who does the same mistake. Thanks in advance!