mit-nlp / MITIE

MITIE: library and tools for information extraction
2.92k stars 535 forks source link

Is MITIE a proper choice for restoring punctuation #218

Closed alexmro closed 8 months ago

alexmro commented 9 months ago

I looked inside some python packages that train a bert model so that it can be used to identify the words that need to have certain punctuation marks before them. They use labels like '.O', '!O', ',O', ':O', ';O', for that. I suppose you know what I mean.

I wonder if the MITIE models can be trained in the same way, that is, if custom labels like these can be used to create entities and if they can give promising results when extracting the information later in order to restore the punctuation. Of course, taking into account that the training material has to be well prepared and optimized for the trainer.

davisking commented 9 months ago

You could probably make something like mitie that used the same features to do that in an ok way. But mitie itself is setup to pick out specific sequences of words, which isn't quite what you want. Since you want to instead identify locations between words.

Might be ok anyway to try to force mitie to do it. IDK. But I would either get an open source LLM model or train a little binary SVM with a window of mitie's word features. A LLM would be way better. But way more computationally expensive though. Depends on the kinds of trades you want to make.

alexmro commented 8 months ago

I was looking at something that can be done locally, in PHP and using this repo as a reference. I will give it a try and see how far can I go, most LLM based solutions for PHP are OpenAI wrappers, anyway. But thanks for opening my horizons