Pretraining not generating added tokens file

Hi , 1. I am trying to further pre train the pretrained-bert on my own corpus using the learner lm model . So , after i'm done training the saved model does not containt the added tokens file and the vocab size remains at 30522 i.e the default size of tokens as pre bert . I've taken a look at the lm_train and lm_test files and could'nt make out the format used .

2. There is also a test % split parameter for the learner_lm model . what does the % refer to ? From my understanding it should refer to the % of masked tokens(as in the google bert oritinal script) from the corpus or am i missing something here .

3. Is it also possible to add support for training whole word masking and skip thoughts in further versions .

Thanks in advance!

utterworks / fast-bert

Pretraining not generating added tokens file #213