Open LeoFrom opened 5 months ago
Hello @LeoFrom. Thank you for providing the text file, that was helpful. I just tried it on the web version and it seems to be working for me. Could you kindly provide some more information?
I'd like to get as close as possible to your setup.
Hello, thank you for your reply.
Maybe that can help you : It was fonctionnal at the beginning but it seems that when I play with the text separator sometimes or switch texts it freezes the process of tagging. I'll include two more different texts so you can maybe recreate the bug
Thanks! I've managed to replicate it now. I'll take a look at why this is happening and try to get back to you soon.
It seems the issue occurs if there are double quotes (") inside the text.
The Treebank Tokenizer we use is a JavaScript port of the one used by the NLTK Python library. The issue was reported (and fixed) by the NLTK team, so it seems we need to update the port.
@tecoholic it would be great if you could help with this one since I haven't looked into your port yet. I could try and give a PR if I can figure out what needs to be changed there.
In the mean time, @LeoFrom if it's an acceptable solution for your use case, you could try replacing all the double quotes with single quotes. I checked locally and it seemed to work alright.
@alvi-khan I will take a look.
@alvi-khan Looking at the code, it looks like those fixes are already in the JS ported version. To confirm, I added the unit tests from the Python version and they are passing as expected. So, I think the issue might be elsewhere and not in the tokenizer.
Selecting text on some .txt files does not annotate the selected text either on windows or web application
Adding the text for testing purpose 01.01.01.01.199.txt