Open ghost opened 5 years ago
If the format of your own domain data is exactly the same as SQUAD, then you can just replace train.json with your own data i.e. no need to append. If your format is different, you will need to create a DataProcessor class to get it into a format that BERT can understand.
We took the easier way. We formatted the content in same json format. Added 1 paragraph, 1 question and 1 answer and trained it. Got new checkpoint. But then we switched to predict mode, then answers did not change (or say did not improve).
That is why wondering what mistake we did. Is it not right way to train? Is the data too less to train? Is vocab missing our new words? Or something else? Any suggestion?
Have you tried other models for comparison? At the very least, BERT should give better performance at the moment.
Actually it is not about comparative accuracy with other models in market. It is just making BERT itself more knowledgeable with some data, which it is not able to answer. I dont want to switch to any other model after making time investment in BERT. I appreciate it and want others too, by giving it incremental training on QnA which it is not able to answer.
Our incremental training approach is failing, therefore looking for a direction.
I have seen some questions here which are unanswered. So simply asking, can anyone suggest how to train BERT on our own domain data? If we append few questions answers (say 10 questions answers) to SQUAD train.json, then will BERT start understanding our domain data and answer same questions with better confidence?