Closed jinyichao closed 8 years ago
Is it possible to separate the extractor from the model? Because I find that even one line of train data may generate over 330 MB model, which is slightly larger than the extractor.
You need the large model to make it work. Or are you asking about making multiple extractors that all share the same word model file so they aren't each 330MB?
Exactly! Is it possible to have multiple models, with the size of each at no more than 100MB, and they share the same extractor, which is of 330MB? Because in this case, if I want multiple models, the total size can be significantly reduced.
The current API assumes you have just one NER model, so it's not possible right now. The underlying tooling can certainly handle it though. I'll see about adding something to the API that supports this use case. Also, what language are you using to interact with MITIE?
I really appreciate your prompt reply! It would be better if this use case can be supported soon! Java is our first choice, but unfortunately, there is no Java APIs for model training. It would be even better if you can provide such Java APIs soon!
There is now a MITIE Java API thanks to this PR https://github.com/mit-nlp/MITIE/pull/32
That's so sweet! Thanks a lot!
At the same time, I also look forward to the separation of extractor and models.
Hi Davis, it is so sweet that you have made another PR to make this use case available. And my final question about this issue is, what exactly inside the extractor, say, total_word_feature_extractor.dat? A clear understanding may help us to further reduce the file size. Thank you!
In my understanding, is it something like the one used in word2vec, say, "vectors.bin" in https://code.google.com/archive/p/word2vec/, that extracts the features of each chunk into a high dimensional space?
PR #33 was added recently to help with the model file size issue. As for what's inside, yes, it's a variant of word2vec based on the two step CCA method from this paper: http://icml.cc/2012/papers/763.pdf. I also upgraded it to include something that is similar to the CCA method but works on out of sample words by analyzing their morphology to produce a word vector. This significantly improved the results on datasets containing lots of words not in the original dictionary.
Thanks for your quick and clear explanation that perfectly addresses my concern! I am just curious that is there a way to customize such data model? Is wordrep in the tools folder the right one?
Yes, that part of the model is generated by the wordrep tool. You could run it and ask it to output a smaller dictionary if you want a smaller model. Note that you need access to a large text corpus such as the gigaword news dataset to generate a high quality model.
Hi, I've come across this library, and found it is really amazing! The accuracy is even better than Stanford NER demo!
Although I understand it contains a high dimensional space with over 500,000 dimensions, is it possible to reduce the model size?