ieeta-pt / Multi-Head-CRF

MIT License
2 stars 0 forks source link

Some entities do not have the text field on the output file from inference #3

Open T-Almeida opened 2 weeks ago

T-Almeida commented 2 weeks ago

I believe merge #2 introduced a bug where some entities are missing their associated text in the output file. Here's a comparison:

New output file:

es-S2254-28842014000200009-1    1346    1355    SYMPTOM -       focalidad
es-S2254-28842014000200009-1    1382    1401    SYMPTOM -       síntoma neurológico
es-S2254-28842014000200009-1    2271    2279    SYMPTOM -
es-S2254-28842014000200009-1    2740    2758    SYMPTOM -
es-S2254-28842014000200009-1    2777    2798    SYMPTOM -
es-S2254-28842014000200009-1    2893    2900    SYMPTOM -
es-S2254-28842014000200009-1    2922    2933    SYMPTOM -
es-S2254-28842014000200009-1    2936    2948    SYMPTOM -
es-S2340-98942015000100005-1    259     271     CHEMICAL        -       carboplatino
es-S2340-98942015000100005-1    274     284     CHEMICAL        -       paclitaxel

Old output file (from commit 11da870121cfa398babd5238d2ae698e6caedf11):

es-S2254-28842014000200009-1    1346    1355    SYMPTOM -       focalidad
es-S2254-28842014000200009-1    1382    1401    SYMPTOM -       síntoma neurológico
es-S2254-28842014000200009-1    2271    2279    SYMPTOM -       sangrado
es-S2254-28842014000200009-1    2740    2758    SYMPTOM -       mal estado general
es-S2254-28842014000200009-1    2777    2798    SYMPTOM -       constantes mantenidas
es-S2254-28842014000200009-1    2893    2900    SYMPTOM -       agitada
es-S2254-28842014000200009-1    2922    2933    SYMPTOM -       hipotensión
es-S2254-28842014000200009-1    2936    2948    SYMPTOM -       convulsiones
es-S2340-98942015000100005-1    259     271     CHEMICAL        -       carboplatino
es-S2340-98942015000100005-1    274     284     CHEMICAL        -       paclitaxel
richardjonker2000 commented 2 weeks ago

i believe the bug is related to document splitting, specifically data.py line 222: text": doc['text'][low_offset: high_offset],

I changed the code so the text field only contains its repsective offset text. I have not verified, but when constructing the results text it will take the text field from the first chunk.

This is then corresponded to line 42 in inference.py: text = documents[doc][0]["text"] Taking only the first document as text

We can either fix the code in inference or in data.

I think its better to fix it in inference.