ValueError: Invalid control character at: line 1120 column 21 (char 28474)

aliabbasjp commented 7 years ago

Follwing error

17:42:39: Parsing finished. Moving parsed files into place ...
Traceback (most recent call last):
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/env.py", line 2168, in interpreter
    out = run_command(tokens)  
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/env.py", line 1113, in run_command
    out = command(tokens[1:])
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/env.py", line 1437, in parse_corpus
    parsed = to_parse.parse(**kwargs)  
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/corpus.py", line 930, in parse
    **kwargs
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/make.py", line 356, in make_corpus
    coref=coref, metadata=metadata)
  File "/home/d/anaconda2/lib/python2.7/site-packages/corpkit/conll.py", line 1113, in convert_json_to_conll
    data = json.load(fo)
  File "/home/d/anaconda2/lib/python2.7/json/__init__.py", line 291, in load
    **kw)
  File "/home/d/anaconda2/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/home/d/anaconda2/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/d/anaconda2/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Invalid control character at: line 1120 column 21 (char 28474)

interrogator commented 7 years ago

Thanks for these reports. This is a weird one---the json output of the CoreNLP parser cannot be understood by Python's json module. So, the problem is not really on corpkit's side, but CoreNLP's.

Similar bugs have been reported to CoreNLP: https://github.com/stanfordnlp/CoreNLP/issues/241

I'm guessing that it relates to the encoding in your text files. Would you be able to zip and upload the files in the unparsed/parsed versions of the corpus? This would help me diagnose the problem and make a fix.

interrogator commented 7 years ago

Also, I'd recommend encoding your text files as UTF-8---that should fix this problem in your case. Or, as per the instructions on the issue linked above, update the CoreNLP installed to the GitHub version. If corpkit installed CoreNLP for you, it should be in your ~/corenlp directory.

interrogator / corpkit

ValueError: Invalid control character at: line 1120 column 21 (char 28474) #43