google-research / electra

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Apache License 2.0
2.31k stars 351 forks source link

The difference of reproduced results on electra_small_owt #73

Open zheyuye opened 4 years ago

zheyuye commented 4 years ago

I used the same hyper-parameters as the paper but generator size 1:1 with the hidden size of 256 as you claimed in #39 to pretrain a electra small model on the openwebtxt dataset. Then fine-tuned this pretrained model with EXCAT same hyper-parameters as the paper resulting in the following outcomes

CoLA SST MRPC STS QQP MNLI QNLI RTE SQuAD 1.1 SQuAD 2.0
Metrics MCC Acc Acc Spearman Acc Acc Acc Acc EM/F1 EM/F1
ELECTRA-Small 57.0 91.2 88.0 87.5 89.0 81.3 88.4 66.7 75.8/-- 70.1/--
ELECTRA-Small-OWT 56.8 88.3 87.4 86.8 88.3 78.9 87.9 68.5 -- --
My Reproduce 51.04 85.21 83.58 84.79 87.16 75.01 84.79 66.06 60.97/70.13 59.83/62.68

There are still a huge gap bewteen my reproduce and the electra_small-owt expect RTE and I am woudering could you please share the SQuAD results on electra_small-owt to facilitate comparison of results.

In addition, I tired the generator size 1:4 with the hidden size of 64 and I've got competitive results. I am wondering why you choose to upload the generator size 1:1 as the offical released electra-small model which conficts both the paper and also the expreiment performance.

Another isssue is about the max squence length, I quote this from your archive paper

we shortened the sequence length (from 512 to 128)

which can be supported by code https://github.com/google-research/electra/blob/79111328070e491b287c307906701ebc61091eb2/configure_pretraining.py#L79 but conflcts the shape (512,128) of electra/embeddings/position_embeddings in the released electra_small model.

Does it means that open source electra_small and the electea_Small_OWT in QuickStart example not only have the difference of pre training data corpus, but also the size of generator and max sequence length.

clarkkev commented 4 years ago

Hi! Using a smaller generator should work better; we used a larger generator for ELECTRA-Small++ (the released ELECTRA-Small model) on accident. This may have hurt its performance a bit, but I doubt by much because the smaller generator mainly helps with efficiency and we trained ELECTRA-Small++ to convergence. What do you mean by "competitive results" when using a size-64 generator? It is not possible to run ELECTRA-Small-OWT on SQuAD because its small max_seq_length is too small.

The different max sequence length shouldn't be an issue because the position embedding tensor is always [512, embedding_size] regardless of config.max_sequence_length; its size is instead defined by max_position_embeddings in the BergConfig (which I agree is a bit confusing).

Yes, the quickstart ELECTRA-Small-OWT model mimics ELECTRA-Small in the paper (but with a different dataset) but the released ELECTRA-Small++ model has a longer sequence length and a larger generator. We released ELECTRA-Small++ rather than ELECTRA-Small because it is better on downstream tasks, but we plan to release the original ELECTRA-Small model in the future.

zheyuye commented 4 years ago

Thanks for answering. From what I understand, the smaller generator are always better by design but using and uploading a missized model is an accidient?

clarkkev commented 4 years ago

That's right. See Figure 3 in our paper for some results with different generator sizes.

amy-hyunji commented 3 years ago

@ZheyuYe Did you get the result of ELECTRA-Small-OWT by pretraining from scratch by yourself? What's the difference between ELECTRA-Small-OWT and your reproduction? Thanks :)

zheyuye commented 3 years ago

@amy-hyunji I re-pretrained the electra small model from scratch with same training setting as ELECTRA-Small-OWT and fine-tuned it on GLUE branchmark in which only QQP and QNLI showed similar results with other seven datasets holding gaps of 0.4-1.5% compared with the published results.