utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.86k stars 341 forks source link

Pretraining not generating added tokens file #213

Open saishashank85 opened 4 years ago

saishashank85 commented 4 years ago

Hi , 1. I am trying to further pre train the pretrained-bert on my own corpus using the learner lm model . So , after i'm done training the saved model does not containt the added tokens file and the vocab size remains at 30522 i.e the default size of tokens as pre bert . I've taken a look at the lm_train and lm_test files and could'nt make out the format used .

2. There is also a test % split parameter for the learner_lm model . what does the % refer to ? From my understanding it should refer to the % of masked tokens(as in the google bert oritinal script) from the corpus or am i missing something here .

3. Is it also possible to add support for training whole word masking and skip thoughts in further versions .

Thanks in advance!

aaronbriel commented 4 years ago

It might be best to submit single issues at a time and label them appropriately as bug/enhancement/help wanted, etc.