stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.69k stars 2.7k forks source link

Invalid JSON output format #61

Closed ghost closed 9 years ago

ghost commented 9 years ago

Hello,

When I try POS tagging with stanford-corenlp-3.5.1, I got following part of the output by StanfordCoreNLP's jsonPrint method.

{ "index": "5", "word": "\'s", "lemma": "\'s", "characterOffsetBegin": "17", "characterOffsetEnd": "19", "pos": "POS" }

Sample sentence: "I was the teacher's student."

It looks like the "word" and "lemma" contain invalid JSON format and json validations fail. Single quote characters do not need to be escaped according to http://json.org/. You can check it in http://jsonlint.com/

I glanced at the code, and maybe this part is relevant to it. https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/JSONOutputter.java#L178

I hope it is not my misunderstanding and I can commit for it. Regards,

gangeli commented 9 years ago

Yes, this is a bug -- thanks for pointing it out! We'll commit the fix within the next day or so. Is crediting it to the username + address on your github ok?

(I think I messed up the commit message for linking the commit to the issue, but it's fixed now in: https://github.com/stanfordnlp/CoreNLP/commit/bae0138ee5f2e60653bcadb13ae9c5ba0762c658)

ghost commented 9 years ago

Thank you for the answer. Yes, I'm happy to be credited!