Open jrmarkle opened 5 years ago
I think this should work in Java code:
// ...
import edu.stanford.nlp.io.IOUtils;
// ...
InputStream kis = IOUtils.getInputStreamFromURLOrClasspathOrFileSystem("/path/to/example.ser");
Pair<Annotation, InputStream> pair = serializer.read(kis);
pair.second.close();
Annotation readAnnotation = pair.first;
also in Java code, shouldn't this work:
import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.simple.Document;
InputStream kis = IOUtils.getInputStreamFromURLOrClasspathOrFileSystem("/path/to/example.ser");
Document myDocument = Document.deserialize(kis);
I'll try to test these out, but if you get a chance let me know if either of these solutions work...
If you want Python examples please let me know...
Thanks for the reply. I tried your second example and it worked so I tried to figure out why. Document.deserialize
uses the generated parseDelimitedFrom
method instead of parseFrom
. Likewise, Document.serialize
uses writeDelimitedTo
. The "delimited" versions put a varint in front of every serialized message which is useful for streaming messages over an open connection. For CoreNLPServer I don't think this is necessary or desirable since the message length is provided in the HTTP response header.
Apparently the java code generator creates writeDelimitedTo
and parseDelimitedFrom
but the golang generator does not. It is easy enough to do myself now that I know it needs to be done:
fileSize, varintSize := proto.DecodeVarint(fileData)
t.Log(fileSize)
t.Log(varintSize)
var document Document
err = proto.Unmarshal(fileData[varintSize:], &document)
if err != nil {
t.Fatal(err)
}
In my example data above the first two bytes are a varint with value 453 and the remaining 453 bytes are a properly-encoded Document
.
The documentation states that "edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer Writes the output to a protocol buffer, as defined in the definition file edu.stanford.nlp.pipeline.CoreNLP.proto." I think this should also state that it uses the delimited form which includes a varint before the Document
message itself. I think most people would not expect that.
I'm running CoreNLP server and I sent it this request:
wget --post-data 'the quick brown fox jumped over the lazy dog' http://localhost:9000/?properties=%7B%22annotators%22%3A%22tokenize%2Cssplit%22%2C%22outputFormat%22%3A%22serialized%22%2C%22serializer%22%3A%22edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%22%7D -O /tmp/output.bin
The output is 455 bytes and looks like it might be correct at first glance:
However, the client I'm working on can't decode this as a
Document
message. It reports the error "proto: wrong wireType = 5 for field Sections".So I tried parsing the data with
protoc
, which also fals:It seems that the output is not valid protobuf data at all.