mit-nlp / MITIE

MITIE: library and tools for information extraction
2.92k stars 537 forks source link

How to bootstrap model with known entities? #62

Closed theconnectionist closed 8 years ago

theconnectionist commented 8 years ago

Hi Davis,

Thank you so much for this high performance open source library. I have one question that I couldn't find an answer to wrt training the entity recognizer.

I would like to take advantage of already known entities, but also be able to recognize entities not already known to the dictionary. For e.g. the wikidata project provides millions of entities and it would be nice to seed the model with those known entities. Couple of approaches I can think of:

  1. Train a new model using whatever training data I can gather. Load known entities into a dictionary. At runtime, say if I am working with a sentence, identify known entities as well as run the sentence through the ner model. Then reconcile the two with the dictionary based reco overriding any conflicting judgements. I wrote this, but don't think this is a good idea.
  2. Generate a large set of training data by plugging in already known entities. E.g. knowing "Davis King" and "MIT" are entities, generate a training sentence "This library is from Davis King of MIT". I would think this approach's results will be heavily influenced by the variation of the filler text generated as part of the training set.
  3. How would you go about doing this? Is there a straight forward technique to seed the model with known entities or a recommended technique to supplement the model with a dictionary?
davisking commented 8 years ago

The only model creation methods in MITIE are the ones documented in the example programs. So you would need to create a new dataset that contained the union of all the entities you wanted to deal with and train on that.

You also seem to be asking if MITIE supports user generated features like "is in my dictionary". There is no API for that since I found gazateers to not make much of a difference in accuracy and they complicate the user workflow. Although the C++ code for running MITIE isn't that complicated so you could add your own additional features by editing it if you wanted to. At the end of the day MITIE is just a simple application of this dlib tool, which is fully documented. So it's easy to modify.

But I wouldn't worry about that. The thing to do is make a single unified training dataset that captures what you want to do and train a model based on that dataset.