sina-al / pynlp

A pythonic wrapper for Stanford CoreNLP.
MIT License
106 stars 11 forks source link

DecodeError: Tag had invalid wire type #6

Closed angoodkind closed 6 years ago

angoodkind commented 6 years ago

When running the analysis on a long list of strings, I always get this error after successfully processing a number of strings:

google.protobuf.message.DecodeError: Tag had invalid wire type.

I'm crawling random webpages, so it doesn't seem to matter what the actual contents of the string are. I'm using BeautifulSoup to extract just the text, and it's coerced into a string to ensure it's unicode.

From what I've read about this error, it seems it occurs when trying to write over an existing file. I think it would be ideal if I could reset the CoreNLP server after each iteration.

My current workflow is

## start corenlp server from command line
$ python3 -m pynlp

In python:

from pynlp import StanfordCoreNLP
annotators = 'tokenize, ssplit, pos, lemma, ner, entitymentions, coref, sentiment'
nlp = StanfordCoreNLP(annotators=annotators)
document = nlp(str(line['text'])) ## line['text'] is a line of unicode text

The trackback call is:


Traceback (most recent call last):
  File "/Users/adamg/Dropbox/Northwestern/Classes/Text_Analytics/homework/ta-hw4/extract_debates.py", line 188, in <module>
    debate_sentiment_dct = analyze_utterances(analysis.get_lines())
  File "/Users/adamg/Dropbox/Northwestern/Classes/Text_Analytics/homework/ta-hw4/sentiment.py", line 14, in analyze_utterances
    document = nlp(str(line['text']))
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/pynlp/client.py", line 65, in __call__
    return self.annotate(text)
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/pynlp/client.py", line 72, in annotate
    return Document(_annotate(text, self._annotators, self._options, self._port))
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/pynlp/client.py", line 34, in _annotate
    return from_bytes(_annotate_binary(text, annotators, options, port))
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/pynlp/client.py", line 39, in from_bytes
    core.parseFromDelimitedString(doc, protobuf)
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/corenlp_protobuf/__init__.py", line 18, in parseFromDelimitedString
    obj.ParseFromString(buf[offset+pos:offset+pos+size])
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/google/protobuf/message.py", line 185, in ParseFromString
    self.MergeFromString(serialized)
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/google/protobuf/internal/python_message.py", line 1069, in MergeFromString
    if self._InternalParse(serialized, 0, length) != length:
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/google/protobuf/internal/python_message.py", line 1095, in InternalParse
    new_pos = local_SkipField(buffer, new_pos, end, tag_bytes)
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/google/protobuf/internal/decoder.py", line 850, in SkipField
    return WIRETYPE_TO_SKIPPER[wire_type](buffer, pos, end)
  File "/Users/adamg/miniconda2/envs/text_analytics3/lib/python3.6/site-packages/google/protobuf/internal/decoder.py", line 820, in _RaiseInvalidWireType
    raise _DecodeError('Tag had invalid wire type.')
google.protobuf.message.DecodeError: Tag had invalid wire type.

On the command line, the CoreNLP server raises the error:


java.util.concurrent.TimeoutException
    at java.util.concurrent.FutureTask.get(FutureTask.java:205)
    at edu.stanford.nlp.pipeline.StanfordCoreNLPServer$CoreNLPHandler.handle(StanfordCoreNLPServer.java:662)
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
    at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:82)
    at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:675)
    at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:79)
    at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:647)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Is there an obvious cause for this error? Alternatively, is there a way to restart the CoreNLP server after each loop within python?

sina-al commented 6 years ago

Hi. If the error is being thrown inconsistently then I suspect it may be due to a timeout from server. Just as soon as I get some time I will look into handling the HTTP requests / responses to and from the CoreNLP server a little more robustly (feel free to contribute). In the mean time, perhaps try specifying a timeout proportional to the size of the text you are annotating e.g python3 -m pynlp --timeout 60000

Please let me know the outcome. Sina

r00t1ng commented 6 years ago

Hi Was having the same issue and I split the long text. I run then the nlp() function on each item of the list. No problem then. Hope this helps.