Closed LLCampos closed 7 years ago
Currently the text for matching dictionary words against is being converted to ASCIII. But this is a problem if the text contain Unicode characters that are supposed to be matched against Unicode characters in the dictionary since the output would be in ASCII.
@fjmc any ideias or bash tricks?
I've created a SO question about this, maybe there is a really easy way that we're missing.
this seems to be the best solution:
awk '{print index($0,"amil")-1}'<<< "α-amilase"
2
awk '{print index($0,"amil")-1}'<<< "fooamilα-whatever"
3
just replacing the last grep for an awk
I will try it
o problema do awk é que só dá um match por linha vou antes ver os caracters que não são ascii e decrementar: $ grep -iaob "[^ -~]" <<< "‘ α-amilwhatever" 0:‘ 4:α
a new fucntion was created get_matches_positions
Fix causes a lot of tests to fail.
fixed $ ./get_entities.sh 1 A "α-amilase α-amilase α-amilaseα-amilase α-amilase" chebi 1 A 0 9 0.54488 α-amilase chebi 1 1 A 10 19 0.54488 α-amilase chebi 1 1 A 39 48 0.54488 α-amilase chebi 1
This causes issues #10, #2 and #13 to reappear.
fixed #2 #13,
I'm going to close this since the problem now is in #10.
When running (alpha-amylase is a vocabulary containing just the word "α-amilase" created for testing purposes):
bash get_entities.sh 1 T "α-amilase" alpha-amylase
We don't get anything, but we should get:
1 T 0 9 0.544880 α-amilase unknown 1