tecoholic / ner-annotator

Named Entity Recognition (NER) Annotation tool for SpaCy. Generates Traning Data as a JSON which can be readily used.
https://tecoholic.github.io/ner-annotator/
MIT License
559 stars 165 forks source link

Generate MsgPack export/import #50

Open leonkunert opened 2 years ago

leonkunert commented 2 years ago

We should try to reimplement the msgPack format from spacy. https://msgpack.org/ should be helpful. Maybe also implement import.

tecoholic commented 2 years ago

I think the current format that spacy uses for NER data is DocBin. I don't know if there is a open spec that will allow reading and writing this format. Maybe reading the spacy code will help.

Either way, I don't see a big need for msgpack.

leonkunert commented 2 years ago

The DocBin format is a gzipped MsgPack https://spacy.io/api/docbin

tecoholic commented 2 years ago

@leonkunert Ah.. I should have RTFD. Thanks for pointing out. Then this is something that should be definitely implemented.

leonkunert commented 2 years ago

The token, spaces and lengths fields can be difficult. They are serialized numpy arrays.