Annotator not annotating some files

tecoholic / ner-annotator

Named Entity Recognition (NER) Annotation tool for SpaCy. Generates Traning Data as a JSON which can be readily used.

https://tecoholic.github.io/ner-annotator/

MIT License

560 stars 165 forks source link

Annotator not annotating some files #111

Open LeoFrom opened 5 months ago

LeoFrom commented 5 months ago

Selecting text on some .txt files does not annotate the selected text either on windows or web application

Adding the text for testing purpose 01.01.01.01.199.txt

alvi-khan commented 5 months ago

Screenshot 2024-06-20 at 00-35-24 NER Annotator for SpaCy

Hello @LeoFrom. Thank you for providing the text file, that was helpful. I just tried it on the web version and it seems to be working for me. Could you kindly provide some more information?

What are you using as the text separator?
What are you using as the annotation precision?
If you're okay with doing so, please provide the tags file.

I'd like to get as close as possible to your setup.

LeoFrom commented 5 months ago

Hello, thank you for your reply.

I'm using --- as my text separator
My annotation precision is word level
Here is my tag file : Tags.json

Maybe that can help you : It was fonctionnal at the beginning but it seems that when I play with the text separator sometimes or switch texts it freezes the process of tagging. I'll include two more different texts so you can maybe recreate the bug

01.01.01.01.55.txt 01.01.01.01.27.txt

alvi-khan commented 5 months ago

Thanks! I've managed to replicate it now. I'll take a look at why this is happening and try to get back to you soon.

alvi-khan commented 5 months ago

It seems the issue occurs if there are double quotes (") inside the text.

The Treebank Tokenizer we use is a JavaScript port of the one used by the NLTK Python library. The issue was reported (and fixed) by the NLTK team, so it seems we need to update the port.

@tecoholic it would be great if you could help with this one since I haven't looked into your port yet. I could try and give a PR if I can figure out what needs to be changed there.

In the mean time, @LeoFrom if it's an acceptable solution for your use case, you could try replacing all the double quotes with single quotes. I checked locally and it seemed to work alright.

tecoholic commented 5 months ago

@alvi-khan I will take a look.

tecoholic commented 5 months ago

@alvi-khan Looking at the code, it looks like those fixes are already in the JS ported version. To confirm, I added the unit tests from the Python version and they are passing as expected. So, I think the issue might be elsewhere and not in the tokenizer.