lasigeBioTM / MER

Minimal Named-Entity Recognizer (MER)
http://labs.fc.ul.pt/mer/
56 stars 8 forks source link

Annotation offsets should be character offsets, not byte offsets #18

Closed LLCampos closed 7 years ago

LLCampos commented 7 years ago

Currently, the offsets of the annotations returned by IBELight are byte offsets, not character offsets. See this explanation of the difference between both.

It should be characters offsets since it's what is usually used in this type of task. In the CEMP task they used character offset, and BioPortal Annotator, also uses it.

Example of difference:

bash get_entities.sh 1 A "‘ oxygen" ChEBI

Byte offset (currently)

1 A 4 10 0.441889 oxygen unknown 1

Character offset (our goal)

1 A 2 8 0.441889 oxygen unknown 1

The difference is because the symbol counts as none character but as more than one byte.

fjmc commented 7 years ago

converted the original_text to ascii original_text=$(iconv -f utf-8 -t ascii//TRANSLIT <<< $3)