Hi ,
1.
I am trying to further pre train the pretrained-bert on my own corpus using the learner lm model .
So , after i'm done training the saved model does not containt the added tokens file and the vocab size remains at 30522 i.e the default size of tokens as pre bert .
I've taken a look at the lm_train and lm_test files and could'nt make out the format used .
2.
There is also a test % split parameter for the learner_lm model . what does the % refer to ?
From my understanding it should refer to the % of masked tokens(as in the google bert oritinal script) from the corpus or am i missing something here .
3.
Is it also possible to add support for training whole word masking and skip thoughts in further versions .
Hi , 1. I am trying to further pre train the pretrained-bert on my own corpus using the learner lm model . So , after i'm done training the saved model does not containt the added tokens file and the vocab size remains at 30522 i.e the default size of tokens as pre bert . I've taken a look at the lm_train and lm_test files and could'nt make out the format used .
2. There is also a test % split parameter for the learner_lm model . what does the % refer to ? From my understanding it should refer to the % of masked tokens(as in the google bert oritinal script) from the corpus or am i missing something here .
3. Is it also possible to add support for training whole word masking and skip thoughts in further versions .
Thanks in advance!