nlplab / nersuite

http://nersuite.nlplab.org/
Other
26 stars 12 forks source link

nersuite_tokenizer produces wrong results on wide character types (Unicode, UTFs) #25

Open fnl opened 10 years ago

fnl commented 10 years ago

There is a "Unicode-bug" in nersuite_common/tokenizer.cpp, Tokenizer::find_token_end: If the isalnum(int) test inside that method fails, the token created is always one single byte wide (because it then returns beg + 1, and beg is a size_t). This means that any multibyte encoded texts, such as all UTFs, cannot be correctly tokenized by this tool (with the logical exception of ASCII-only containing UTF-8, naturally) because it splits [more than one byte] wide characters in two or more tokens. This is even nastier in the case of UTF-8 encoded text, because the bug is non-obvious and only becomes apparent when special characters like non-ASCII dashes or Greek letters are present in the text.

priancho commented 10 years ago

Thank you for your bug-report.

At the beginning of developing this application, we used a pre-processing program that converts Unicode characters to Ascii characters. It is not completely same program but you can find one from https://github.com/spyysalo/unicode2ascii

I also would like to make NERsuite to handle multibyte input since non-ascii characters virtually appear in all biomedical texts. Unfortunately, it will take some time, at least a few months, to make a time for this improvement because I am currently preparing my thesis defense presentation that will be at the beginning of Feb.

Best wishes,

fnl commented 10 years ago

Thanks for the reply and the link to a (hopefully, temporary) work-around. Good luck with the defense!