lasigeBioTM / MER

Minimal Named-Entity Recognizer (MER)
http://labs.fc.ul.pt/mer/
56 stars 8 forks source link

Wrong indexes #11

Closed LLCampos closed 7 years ago

LLCampos commented 7 years ago

When doing

bash get_entities.sh CN101084906A A "The invention discloses a new use of phenanthridine biologic alcali of general formula I containing coumarone and its derivant in preparing anti-hepatitis b virus medicament.The biologic alcali has tumor cell apoptosis inducing effect, has strong inhibiting activity against to such pathogen, has antiviral effect, and is expected for treating liver cancer, wherein R1 to R10, R12 and R13 is hydrogen, hydroxy, carbon chain or naphthenic base with 1-12 carbon atoms, alkoxyl or acyloxy group, benzyloxy, chlorine and other halogen atoms, amino group, methylol, aldehyde group, carbonyl, acetonyl, carboxy, sulacyloxy, 4-methyl-benzenesulfonyloxyl, arylsulfonyloxy, diphenylphosphonoxyl, ‘ -OCONH2; R11 is hydrogen, methyl or oxygen atom; R14 and R15 are respectively hydrogen or methyl." ChEBI

In the result. the annotation corresponding "oxygen" is:

CN101084906A A 727 733 0.441889 oxygen unknown 1

But in the gold standard, the corresponding annotation is:

CN101084906A A 725 731 oxygen SYSTEMATIC

There's a difference of 2 characters.

A ipython session that might help understand the problem:

In [1]: text = "The invention discloses a new use of phenanthridine biologic alcali of general formula I containing coumarone and its derivant in preparing anti-hepatitis b virus medicament.The biologic alcali has tumor cell apoptosis inducing effect, has strong inhibiting activity against to such pathogen, has antiviral effect, and is expected for treating liver cancer, wherein R1 to R10, R12 and R13 is hydrogen, hydroxy, carbon chain or naphthenic base with 1-12 carbon atoms, alkoxyl or acyloxy group, benzyloxy, chlorine and other halogen atoms, amino group, methylol, aldehyde group, carbonyl, acetonyl, carboxy, sulacyloxy, 4-methyl-benzenesulfonyloxyl, arylsulfonyloxy, diphenylphosphonoxyl, ‘ -OCONH2; R11 is hydrogen, methyl or oxygen atom; R14 and R15 are respectively hydrogen or methyl."

In [2]: text[725:731]
Out[2]: 'r oxyg'

In [3]: text[727:733]
Out[3]: 'oxygen'

In [4]: text.decode('utf-8')[725:731]
Out[4]: u'oxygen'
LLCampos commented 7 years ago

The non-ASCII character '\xe2' was being counted as having more than one character, which would cause the problem with indexes. The fix consists of replacing this and all the other special characters with a default "." in the text that is used for the final matching (what was before called "original_text" ).