lasigeBioTM / MER

Minimal Named-Entity Recognizer (MER)
http://labs.fc.ul.pt/mer/
56 stars 8 forks source link

get_entities should match whole words only #28

Closed LLCampos closed 7 years ago

LLCampos commented 7 years ago

When running:

bash get_entities.sh 1 T 'methanol ethanol' ChEBI

We get:

1   T   0   8   0.519102    methanol    ChEBI   1
1   T   1   8   0.486102    ethanol ChEBI   1
1   T   9   16  0.486102    ethanol ChEBI   1

We should get:

1   T   0   8   0.519102    methanol    ChEBI   1
1   T   9   16  0.486102    ethanol ChEBI   1

It's important to notice that the output of running

bash get_entities.sh 1 T 'methanol' ChEBI

It's correct:

1 T 0 8 0.519102 methanol ChEBI 1

fjmc commented 7 years ago

fixed

$ ./get_entities.sh 1 T 'methanol ethanol' chebi
1       T       0       8       0.519102        methanol        chebi   1
1       T       2       9       0.486102        ethanol chebi   1
LLCampos commented 7 years ago

@fjmc, the indexes are wrong.

LLCampos commented 7 years ago

@fjmc

This is not solved. When we run:

bash get_entities.sh 1 T "chlorotomoxetine tomoxetine" ChEBI

We get:

1 T 7 17 0.565706 tomoxetine ChEBI 1

The indexes are wrong, it should be:

1 T 18 28 0.565706 tomoxetine ChEBI 1

This causes weird stuff to happen, like this:

bash get_entities.sh 1 T "chlorotomoxetine tomoxetine diethyl ether water" ChEBI

1   T   7   17  0.565706    tomoxetine  ChEBI   1
1   T   26  31  0.378665    ether   ChEBI   1
1   T   32  37  0.378665    water   ChEBI   1
1   T   28  41  0.610129    diethyl ether   ChEBI   1

(see the two last entries, there is overlap)

fjmc commented 7 years ago
$ ./get_entities.sh 1 T "chlorotomoxetine tomoxetine diethyl ether water" chebi
1       T       17      27      0.565706        tomoxetine      chebi   1
1       T       36      41      0.378665        ether   chebi   1
1       T       42      47      0.378665        water   chebi   1
1       T       28      41      0.610129        diethyl ether   chebi   1