Open zheyuye opened 4 years ago
Hi! Using a smaller generator should work better; we used a larger generator for ELECTRA-Small++ (the released ELECTRA-Small model) on accident. This may have hurt its performance a bit, but I doubt by much because the smaller generator mainly helps with efficiency and we trained ELECTRA-Small++ to convergence. What do you mean by "competitive results" when using a size-64 generator? It is not possible to run ELECTRA-Small-OWT on SQuAD because its small max_seq_length
is too small.
The different max sequence length shouldn't be an issue because the position embedding tensor is always [512, embedding_size]
regardless of config.max_sequence_length
; its size is instead defined by max_position_embeddings in the BergConfig
(which I agree is a bit confusing).
Yes, the quickstart ELECTRA-Small-OWT model mimics ELECTRA-Small in the paper (but with a different dataset) but the released ELECTRA-Small++ model has a longer sequence length and a larger generator. We released ELECTRA-Small++ rather than ELECTRA-Small because it is better on downstream tasks, but we plan to release the original ELECTRA-Small model in the future.
Thanks for answering. From what I understand, the smaller generator are always better by design but using and uploading a missized model is an accidient?
That's right. See Figure 3 in our paper for some results with different generator sizes.
@ZheyuYe Did you get the result of ELECTRA-Small-OWT by pretraining from scratch by yourself? What's the difference between ELECTRA-Small-OWT and your reproduction? Thanks :)
@amy-hyunji I re-pretrained the electra small model from scratch with same training setting as ELECTRA-Small-OWT and fine-tuned it on GLUE branchmark in which only QQP and QNLI showed similar results with other seven datasets holding gaps of 0.4-1.5% compared with the published results.
I used the same hyper-parameters as the paper but generator size 1:1 with the hidden size of 256 as you claimed in #39 to pretrain a electra small model on the openwebtxt dataset. Then fine-tuned this pretrained model with EXCAT same hyper-parameters as the paper resulting in the following outcomes
There are still a huge gap bewteen my reproduce and the electra_small-owt expect RTE and I am woudering could you please share the SQuAD results on electra_small-owt to facilitate comparison of results.
In addition, I tired the generator size 1:4 with the hidden size of 64 and I've got competitive results. I am wondering why you choose to upload the generator size 1:1 as the offical released electra-small model which conficts both the paper and also the expreiment performance.
Another isssue is about the max squence length, I quote this from your archive paper
which can be supported by code https://github.com/google-research/electra/blob/79111328070e491b287c307906701ebc61091eb2/configure_pretraining.py#L79 but conflcts the shape (512,128) of
electra/embeddings/position_embeddings
in the released electra_small model.Does it means that open source electra_small and the electea_Small_OWT in QuickStart example not only have the difference of pre training data corpus, but also the size of generator and max sequence length.