sogou / SogouMRCToolkit

This toolkit was designed for the fast and efficient development of modern machine comprehension models, including both published models and original prototypes.
Apache License 2.0
746 stars 164 forks source link

improvement to 'load' and 'save' method #9

Closed threefoldo closed 5 years ago

threefoldo commented 5 years ago

The 'load' and 'save' methods in BaseModel only deal with the model, other related data like 'vocab' is not processed. So, to load a model from file, the vocabulary needs to be rebuilt or loaded separately, which is not ideal. It's better to save and load all related stuff.

Class vocabulary has 'load' and 'save' methods, but all data are save in JSON. It's larger and easy to corrupt. I don't see the benefit of using JSON format here.

SunnyMarkLiu commented 5 years ago

In my opinion, the model and vocabulary should be saved separately. Consider the following example, we trained many models inherited from BaseModel and they shared the same vocabulary, obviously, saving vocabulary from the model can save many space. And in other words, saving separately can be favorable for later maintenance.

threefoldo commented 5 years ago

I agree that BaseModel should only consider the model, no data. However, high level class like BiDAFPlusSQuad and BertCoQA cannot work without data, and their data is different, BiDAFPlusSQuad needs pretrained embedding, BertCoQA doesn't. What I suggest here is a better way to let users know what data are necessary to save and rebuild, the example code has little information about that.

yylun commented 5 years ago
  1. Saving vocab as json file is for the purpose of better debug experience, and it does help us in experiments. And for smaller file size, simply saving it as binary is ok .
  2. Vocab is usually corpus related. Bert is a special case, since it is pre-trained on different texts.
  3. I see that the main problem is we did not provide examples about model saving and rebuilding, will work on it. Thanks for helpful discussion 😁
threefoldo commented 5 years ago

Thanks, @yukyang, that should solve all my questions.

A minor error in the new "model_save_load.md", train_batch_generator and eval_batch_generator are unnecessary during inference, it should only use test_batch_generator, which is not declared.