google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
37.92k stars 9.57k forks source link

Custom Domain Training #498

Open ghost opened 5 years ago

ghost commented 5 years ago

I have seen some questions here which are unanswered. So simply asking, can anyone suggest how to train BERT on our own domain data? If we append few questions answers (say 10 questions answers) to SQUAD train.json, then will BERT start understanding our domain data and answer same questions with better confidence?

hsm207 commented 5 years ago

If the format of your own domain data is exactly the same as SQUAD, then you can just replace train.json with your own data i.e. no need to append. If your format is different, you will need to create a DataProcessor class to get it into a format that BERT can understand.

ghost commented 5 years ago

We took the easier way. We formatted the content in same json format. Added 1 paragraph, 1 question and 1 answer and trained it. Got new checkpoint. But then we switched to predict mode, then answers did not change (or say did not improve).
That is why wondering what mistake we did. Is it not right way to train? Is the data too less to train? Is vocab missing our new words? Or something else? Any suggestion?

hsm207 commented 5 years ago

Have you tried other models for comparison? At the very least, BERT should give better performance at the moment.

ghost commented 5 years ago

Actually it is not about comparative accuracy with other models in market. It is just making BERT itself more knowledgeable with some data, which it is not able to answer. I dont want to switch to any other model after making time investment in BERT. I appreciate it and want others too, by giving it incremental training on QnA which it is not able to answer.
Our incremental training approach is failing, therefore looking for a direction.