Hi, guys, thank you for your electra models. Recently, i used my own data to continue pretrain a base-level electra model(this one: https://github.com/ymcui/Chinese-ELECTRA). This pretrain model is a Chinese electra, so it has a different vocab.txt(same as bert base vocab 21128 lines). So, what I have done was: First, I used build_pretrain_dateset.py(21128 vocab) to generate tfrecords. Second, I added init_checkpoint according to this:https://github.com/google-research/electra/pull/74. Third, I pretrained my own base-level electra-chinese model from Chinese-ELETRA, the parameters i used were lr 2e-4, training steps 1000000, base model. command line are like this: python3 run_pretraining.py --data-dir pretrain_chinese_model/ --model-name my_model --init_checkpoint pretrain_chinese_model/models/Chinese-Electra
The loss after 100000 steps was around 3.4. However, I used 100000 steps pretrain_model to finetune a classification model. the performance is much worse than original Chinese_Electra. i was wondering why even 100000steps continue pretrain from Chinese_Electra could make such a bad performance, did I make any mistakes.
Hi, guys, thank you for your electra models. Recently, i used my own data to continue pretrain a base-level electra model(this one: https://github.com/ymcui/Chinese-ELECTRA). This pretrain model is a Chinese electra, so it has a different vocab.txt(same as bert base vocab 21128 lines). So, what I have done was: First, I used build_pretrain_dateset.py(21128 vocab) to generate tfrecords. Second, I added init_checkpoint according to this:https://github.com/google-research/electra/pull/74. Third, I pretrained my own base-level electra-chinese model from Chinese-ELETRA, the parameters i used were lr 2e-4, training steps 1000000, base model. command line are like this: python3 run_pretraining.py --data-dir pretrain_chinese_model/ --model-name my_model --init_checkpoint pretrain_chinese_model/models/Chinese-Electra
The loss after 100000 steps was around 3.4. However, I used 100000 steps pretrain_model to finetune a classification model. the performance is much worse than original Chinese_Electra. i was wondering why even 100000steps continue pretrain from Chinese_Electra could make such a bad performance, did I make any mistakes.