lasigeBioTM / MER

Minimal Named-Entity Recognizer (MER)

http://labs.fc.ul.pt/mer/

56 stars 8 forks source link

Unicode characters in vocabularies don't match unicode characters in text #19

Closed LLCampos closed 7 years ago

LLCampos commented 7 years ago

When running (alpha-amylase is a vocabulary containing just the word "α-amilase" created for testing purposes):

bash get_entities.sh 1 T "α-amilase" alpha-amylase

We don't get anything, but we should get:

1 T 0 9 0.544880 α-amilase unknown 1

LLCampos commented 7 years ago

Currently the text for matching dictionary words against is being converted to ASCIII. But this is a problem if the text contain Unicode characters that are supposed to be matched against Unicode characters in the dictionary since the output would be in ASCII.

@fjmc any ideias or bash tricks?

I've created a SO question about this, maybe there is a really easy way that we're missing.