stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.69k stars 2.7k forks source link

Invalid output when using ProtobufAnnotationSerializer #798

Open jrmarkle opened 5 years ago

jrmarkle commented 5 years ago

I'm running CoreNLP server and I sent it this request:

wget --post-data 'the quick brown fox jumped over the lazy dog' http://localhost:9000/?properties=%7B%22annotators%22%3A%22tokenize%2Cssplit%22%2C%22outputFormat%22%3A%22serialized%22%2C%22serializer%22%3A%22edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%22%7D -O /tmp/output.bin

The output is 455 bytes and looks like it might be correct at first glance:

$ xxd /tmp/output.bin 
00000000: c503 0a2c 7468 6520 7175 6963 6b20 6272  ...,the quick br
00000010: 6f77 6e20 666f 7820 6a75 6d70 6564 206f  own fox jumped o
00000020: 7665 7220 7468 6520 6c61 7a79 2064 6f67  ver the lazy dog
00000030: 128b 030a 240a 0374 6865 1a03 7468 652a  ....$..the..the*
00000040: 0032 0120 3a03 7468 6558 0060 0388 0100  .2. :.theX.`....
00000050: 9001 01a8 0100 b002 000a 2b0a 0571 7569  ..........+..qui
00000060: 636b 1a05 7175 6963 6b2a 0120 3201 203a  ck..quick*. 2. :
00000070: 0571 7569 636b 5804 6009 8801 0190 0102  .quickX.`.......
00000080: a801 00b0 0200 0a2b 0a05 6272 6f77 6e1a  .......+..brown.
00000090: 0562 726f 776e 2a01 2032 0120 3a05 6272  .brown*. 2. :.br
000000a0: 6f77 6e58 0a60 0f88 0102 9001 03a8 0100  ownX.`..........
000000b0: b002 000a 250a 0366 6f78 1a03 666f 782a  ....%..fox..fox*
000000c0: 0120 3201 203a 0366 6f78 5810 6013 8801  . 2. :.foxX.`...
000000d0: 0390 0104 a801 00b0 0200 0a2e 0a06 6a75  ..............ju
000000e0: 6d70 6564 1a06 6a75 6d70 6564 2a01 2032  mped..jumped*. 2
000000f0: 0120 3a06 6a75 6d70 6564 5814 601a 8801  . :.jumpedX.`...
00000100: 0490 0105 a801 00b0 0200 0a28 0a04 6f76  ...........(..ov
00000110: 6572 1a04 6f76 6572 2a01 2032 0120 3a04  er..over*. 2. :.
00000120: 6f76 6572 581b 601f 8801 0590 0106 a801  overX.`.........
00000130: 00b0 0200 0a25 0a03 7468 651a 0374 6865  .....%..the..the
00000140: 2a01 2032 0120 3a03 7468 6558 2060 2388  *. 2. :.theX `#.
00000150: 0106 9001 07a8 0100 b002 000a 280a 046c  ............(..l
00000160: 617a 791a 046c 617a 792a 0120 3201 203a  azy..lazy*. 2. :
00000170: 046c 617a 7958 2460 2888 0107 9001 08a8  .lazyX$`(.......
00000180: 0100 b002 000a 240a 0364 6f67 1a03 646f  ......$..dog..do
00000190: 672a 0120 3200 3a03 646f 6758 2960 2c88  g*. 2.:.dogX)`,.
000001a0: 0108 9001 09a8 0100 b002 0010 0018 0920  ............... 
000001b0: 0028 0030 2c98 0300 b003 0088 0400 5800  .(.0,.........X.
000001c0: 6800 7800 8001 00                        h.x....

However, the client I'm working on can't decode this as a Document message. It reports the error "proto: wrong wireType = 5 for field Sections".

So I tried parsing the data with protoc, which also fals:

$ protoc --decode_raw < /tmp/output.bin
Failed to parse input.

It seems that the output is not valid protobuf data at all.

J38 commented 5 years ago

I think this should work in Java code:

// ...
import edu.stanford.nlp.io.IOUtils;
// ...
InputStream kis = IOUtils.getInputStreamFromURLOrClasspathOrFileSystem("/path/to/example.ser");
Pair<Annotation, InputStream> pair = serializer.read(kis);
pair.second.close();
Annotation readAnnotation = pair.first;

also in Java code, shouldn't this work:

import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.simple.Document;

InputStream kis = IOUtils.getInputStreamFromURLOrClasspathOrFileSystem("/path/to/example.ser");
Document myDocument = Document.deserialize(kis);

I'll try to test these out, but if you get a chance let me know if either of these solutions work...

If you want Python examples please let me know...

jrmarkle commented 5 years ago

Thanks for the reply. I tried your second example and it worked so I tried to figure out why. Document.deserialize uses the generated parseDelimitedFrom method instead of parseFrom. Likewise, Document.serialize uses writeDelimitedTo. The "delimited" versions put a varint in front of every serialized message which is useful for streaming messages over an open connection. For CoreNLPServer I don't think this is necessary or desirable since the message length is provided in the HTTP response header.

Apparently the java code generator creates writeDelimitedTo and parseDelimitedFrom but the golang generator does not. It is easy enough to do myself now that I know it needs to be done:

    fileSize, varintSize := proto.DecodeVarint(fileData)
    t.Log(fileSize)
    t.Log(varintSize)

    var document Document
    err = proto.Unmarshal(fileData[varintSize:], &document)
    if err != nil {
        t.Fatal(err)
    }

In my example data above the first two bytes are a varint with value 453 and the remaining 453 bytes are a properly-encoded Document.

The documentation states that "edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer Writes the output to a protocol buffer, as defined in the definition file edu.stanford.nlp.pipeline.CoreNLP.proto." I think this should also state that it uses the delimited form which includes a varint before the Document message itself. I think most people would not expect that.