Ambiguity in indexing of labels

tecoholic / ner-annotator

Named Entity Recognition (NER) Annotation tool for SpaCy. Generates Traning Data as a JSON which can be readily used.

https://tecoholic.github.io/ner-annotator/

MIT License

549 stars 163 forks source link

Ambiguity in indexing of labels #21

Closed Faran-Javaid closed 2 years ago

Faran-Javaid commented 2 years ago

Hi @tecoholic . First of all thanks for this great repository. Secondly I would like to ask you a question. Can you please explain on which string are you indexing (either original or after tokenization) because when I am testing the exported json file the indexes are not appropriate.

tecoholic commented 2 years ago

@Faran-Javaid Hi, the indices are calculated as per the TreebankTokenizer algorithm. The relevant code can be see here

https://github.com/tecoholic/ner-annotator/blob/main/annotator/server.py#L20

rsparth commented 2 years ago

This index issue is happened with me as well. If I consider original text then index works fine, but if I try indexing on text which is present in JSON, it returns wrong result.

Faran-Javaid commented 2 years ago

@Faran-Javaid Hi, the indices are calculated as per the TreebankTokenizer algorithm. The relevant code can be see here

https://github.com/tecoholic/ner-annotator/blob/main/annotator/server.py#L20

Thanks for the response @tecoholic . I have noticed that this tool works completely fine when the text is properly formatted. However, if the string contains multiple space or \t or \n characters then the indexing seems to be going wrong. I have figured it out and fixed this issue by taking the original text instead of tokenized text for annotations. Please let me know if you want me to make a pull request for the above mentioned changes. Cheers!

tecoholic commented 2 years ago

@Faran-Javaid It would be nice to have the improvement included. Please send a PR, I will be happy to merge. Thanks in advance.

tecoholic commented 2 years ago

PR & Further Discussion in #23