lasigeBioTM / MER

Minimal Named-Entity Recognizer (MER)
http://labs.fc.ul.pt/mer/
56 stars 8 forks source link

Unicode characters in vocabularies don't match unicode characters in text #19

Closed LLCampos closed 7 years ago

LLCampos commented 7 years ago

When running (alpha-amylase is a vocabulary containing just the word "α-amilase" created for testing purposes):

bash get_entities.sh 1 T "α-amilase" alpha-amylase

We don't get anything, but we should get:

1 T 0 9 0.544880 α-amilase unknown 1

LLCampos commented 7 years ago

Currently the text for matching dictionary words against is being converted to ASCIII. But this is a problem if the text contain Unicode characters that are supposed to be matched against Unicode characters in the dictionary since the output would be in ASCII.

@fjmc any ideias or bash tricks?

I've created a SO question about this, maybe there is a really easy way that we're missing.

fjmc commented 7 years ago

this seems to be the best solution: awk '{print index($0,"amil")-1}'<<< "α-amilase"
2 awk '{print index($0,"amil")-1}'<<< "fooamilα-whatever"
3 just replacing the last grep for an awk I will try it

fjmc commented 7 years ago

o problema do awk é que só dá um match por linha vou antes ver os caracters que não são ascii e decrementar: $ grep -iaob "[^ -~]" <<< "‘ α-amilwhatever" 0:‘ 4:α

fjmc commented 7 years ago

a new fucntion was created get_matches_positions

LLCampos commented 7 years ago

Fix causes a lot of tests to fail.

fjmc commented 7 years ago

fixed $ ./get_entities.sh 1 A "α-amilase α-amilase α-amilaseα-amilase α-amilase" chebi 1 A 0 9 0.54488 α-amilase chebi 1 1 A 10 19 0.54488 α-amilase chebi 1 1 A 39 48 0.54488 α-amilase chebi 1

LLCampos commented 7 years ago

This causes issues #10, #2 and #13 to reappear.

fjmc commented 7 years ago

fixed #2 #13,

10 still to decide what to do

LLCampos commented 7 years ago

I'm going to close this since the problem now is in #10.