sina-al / pynlp

A pythonic wrapper for Stanford CoreNLP.
MIT License
106 stars 11 forks source link

UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c' #19

Closed fernio closed 6 years ago

fernio commented 6 years ago

I'm trying to use pynlp to process a bunch of text files, but I'm having trouble with one of them crashpynlp.txt . Using the following script

from pynlp import StanfordCoreNLP

with open("crashpynlp.txt", 'r') as file:
    text = file.read()
    nlp = StanfordCoreNLP(annotators="tokenize, ssplit, pos, lemma, ner")
    doc = nlp(text)

I'm getting the following traceback

  File "testPynlp.py", line 6, in <module>
    doc = nlp(text)
  File "/home/fernio/.local/lib/python3.6/site-packages/pynlp/client.py", line 132, in __call__
    return self.annotate_one(texts)
  File "/home/fernio/.local/lib/python3.6/site-packages/pynlp/client.py", line 138, in annotate_one
    return Document(self._annotate(text))
  File "/home/fernio/.local/lib/python3.6/site-packages/pynlp/client.py", line 135, in _annotate
    return self._client.post(url=self._address, data=text, params=(('properties', str(self._properties)),))
  File "/home/fernio/.local/lib/python3.6/site-packages/requests/sessions.py", line 559, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/home/fernio/.local/lib/python3.6/site-packages/pynlp/client.py", line 81, in request
    response = super(CoreNLPClient, self).request(*args, **kwargs)
  File "/home/fernio/.local/lib/python3.6/site-packages/requests/sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/fernio/.local/lib/python3.6/site-packages/requests/sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "/home/fernio/.local/lib/python3.6/site-packages/requests/adapters.py", line 445, in send
    timeout=timeout
  File "/home/fernio/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/home/fernio/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1239, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1284, in _send_request
    body = _encode(body, 'body')
  File "/usr/lib/python3.6/http/client.py", line 161, in _encode
    (name.title(), data[err.start:err.end], name)) from None
UnicodeEncodeError: 'latin-1' codec can't encode character '\u201c' in position 39: Body ('“') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.
sina-al commented 6 years ago

Try encoding the text using UTF-8

from pynlp import StanfordCoreNLP

with open("crashpynlp.txt", 'r') as file:
    text = file.read().encode('utf-8')
    nlp = StanfordCoreNLP(annotators="tokenize, ssplit, pos, lemma, ner")
    doc = nlp(text)
fernio commented 6 years ago

That did the trick, thanks.