machinalis / iepy

Information Extraction in Python
BSD 3-Clause "New" or "Revised" License
906 stars 186 forks source link

Multi-token entities not recognised as such with IO annotation #130

Closed milesscherrer closed 7 years ago

milesscherrer commented 7 years ago

Hi,

I have plugged in a custom trained StanfordNER model in IEPY with my own class labels (not the typical PERSON, LOCATION tags but very domain specific). However when labelling in IEPY, it does not recognise consecutive entity tags of the same class (IO annotated) as one entity but instead as one entity per token.

In the custom model, a PERSON such as "William Benjamin Euba" is recognised as one multi-token PERSON entity. How can you make this apply to custom entity classes?

jmansilla commented 7 years ago

We had to make some explicit "merging" after receiving the NER output. In our case forthe stanford net I think it was made here https://github.com/machinalis/iepy/blob/develop/iepy/preprocess/stanford_preprocess.py#L317

milesscherrer commented 7 years ago

After looking into the code some more I now believe it is done in: https://github.com/machinalis/iepy/blob/develop/iepy/preprocess/ner/combiner.py

However, we do not see these methods being used anywhere in the code. Do you know if this is in fact where it is done and how/where to call it?