mit-nlp / MITIE

MITIE: library and tools for information extraction
2.92k stars 537 forks source link

Is it possible to reduce the size of the model #31

Closed jinyichao closed 8 years ago

jinyichao commented 8 years ago

Hi, I've come across this library, and found it is really amazing! The accuracy is even better than Stanford NER demo!

Although I understand it contains a high dimensional space with over 500,000 dimensions, is it possible to reduce the model size?

jinyichao commented 8 years ago

Is it possible to separate the extractor from the model? Because I find that even one line of train data may generate over 330 MB model, which is slightly larger than the extractor.

davisking commented 8 years ago

You need the large model to make it work. Or are you asking about making multiple extractors that all share the same word model file so they aren't each 330MB?

jinyichao commented 8 years ago

Exactly! Is it possible to have multiple models, with the size of each at no more than 100MB, and they share the same extractor, which is of 330MB? Because in this case, if I want multiple models, the total size can be significantly reduced.

davisking commented 8 years ago

The current API assumes you have just one NER model, so it's not possible right now. The underlying tooling can certainly handle it though. I'll see about adding something to the API that supports this use case. Also, what language are you using to interact with MITIE?

jinyichao commented 8 years ago

I really appreciate your prompt reply! It would be better if this use case can be supported soon! Java is our first choice, but unfortunately, there is no Java APIs for model training. It would be even better if you can provide such Java APIs soon!

davisking commented 8 years ago

There is now a MITIE Java API thanks to this PR https://github.com/mit-nlp/MITIE/pull/32

jinyichao commented 8 years ago

That's so sweet! Thanks a lot!

At the same time, I also look forward to the separation of extractor and models.

jinyichao commented 8 years ago

Hi Davis, it is so sweet that you have made another PR to make this use case available. And my final question about this issue is, what exactly inside the extractor, say, total_word_feature_extractor.dat? A clear understanding may help us to further reduce the file size. Thank you!

jinyichao commented 8 years ago

In my understanding, is it something like the one used in word2vec, say, "vectors.bin" in https://code.google.com/archive/p/word2vec/, that extracts the features of each chunk into a high dimensional space?

davisking commented 8 years ago

PR #33 was added recently to help with the model file size issue. As for what's inside, yes, it's a variant of word2vec based on the two step CCA method from this paper: http://icml.cc/2012/papers/763.pdf. I also upgraded it to include something that is similar to the CCA method but works on out of sample words by analyzing their morphology to produce a word vector. This significantly improved the results on datasets containing lots of words not in the original dictionary.

jinyichao commented 8 years ago

Thanks for your quick and clear explanation that perfectly addresses my concern! I am just curious that is there a way to customize such data model? Is wordrep in the tools folder the right one?

davisking commented 8 years ago

Yes, that part of the model is generated by the wordrep tool. You could run it and ask it to output a smaller dictionary if you want a smaller model. Note that you need access to a large text corpus such as the gigaword news dataset to generate a high quality model.