Open leonkunert opened 2 years ago
I think the current format that spacy uses for NER data is DocBin. I don't know if there is a open spec that will allow reading and writing this format. Maybe reading the spacy code will help.
Either way, I don't see a big need for msgpack.
The DocBin format is a gzipped MsgPack https://spacy.io/api/docbin
@leonkunert Ah.. I should have RTFD. Thanks for pointing out. Then this is something that should be definitely implemented.
The token, spaces and lengths fields can be difficult. They are serialized numpy arrays.
We should try to reimplement the msgPack format from spacy. https://msgpack.org/ should be helpful. Maybe also implement import.