richarddwang / electra_pytorch

Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)
325 stars 42 forks source link
deeplearning electra fastai glue huggingface language-model nlp pytorch

Unofficial PyTorch implementation of

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators by Kevin Clark. Minh-Thang Luong. Quoc V. Le. Christopher D. Manning

※ For updates and more work in the future, follow Twitter

Replicated Results

I pretrain ELECTRA-small from scratch and have successfully replicated the paper's results on GLUE.

Model CoLA SST MRPC STS QQP MNLI QNLI RTE Avg. of Avg.
ELECTRA-Small-OWT 56.8 88.3 87.4 86.8 88.3 78.9 87.9 68.5 80.36
ELECTRA-Small-OWT (my) 58.72 88.03 86.04 86.16 88.63 80.4 87.45 67.46 80.36

Table 1: Results on GLUE dev set. The official result comes from expected results. Scores are the average scores finetuned from the same checkpoint. (See this issue) My result comes from pretraining a model from scratch and thens taking average from 10 finetuning runs for each task. Both results are trained on OpenWebText corpus

Model CoLA SST MRPC STS QQP MNLI QNLI RTE Avg.
ELECTRA-Small++ 55.6 91.1 84.9 84.6 88.0 81.6 88.3 63.6 79.7
ELECTRA-Small++ (my) 54.8 91.6 84.6 84.2 88.5 82 89 64.7 79.92

Table 2: Results on GLUE test set. My result finetunes the pretrained checkpoint loaded from huggingface.

Official training loss curve My training loss curve
image image

Table 3: Both are small models trained on OpenWebText. The official one is from here. You should take the value of training loss with a grain of salt since it doesn't reflect the performance of downstream tasks.

Features of this implementation

More results

How stable is ELECTRA pretraining?

Mean Std Max Min #models
81.38 0.57 82.23 80.42 14

Tabel 4: Statistics of GLUE devset results for small models. Every model is pretrained from scratch with different seeds and finetuned for 10 random runs for each GLUE task. Score of a model is the average of the best of 10 for each task. (The process is as same as the one described in the paper) As we can see, although ELECTRA is mocking adeversarial training, it has a good training stability.

How stable is ELECTRA finetuing on GLUE ?

Model CoLA SST MRPC STS QQP MNLI QNLI RTE
ELECTRA-Small-OWT (my) 1.30 0.49 0.7 0.29 0.1 0.15 0.33 1.93

Table 5: Standard deviation for each task. This is the same model as Table 1, which finetunes 10 runs for each task.

Discussion

HuggingFace forum post
Fastai forum post

Usage

Note: This project is actually for my personal research. So I didn't trying to make it easy to use for all users, but trying to make it easy to read and modify.

Install requirements

pip3 install -r requirements.txt

Steps

  1. python pretrain.py
  2. set pretrained_checkcpoint in finetune.py to use the checkpoint you've pretrained and saved in electra_pytorch/checkpoints/pretrain.
  3. python finetune.py (with do_finetune set to True)
  4. Go to neptune, pick the best run of 10 runs for each task, and set th_runs in finetune.py according to the numbers in the names of runs you picked.
  5. python finetune.py (with do_finetune set to False), this outpus predictions on testset, you can then compress and send .tsvs in electra_pytorch/test_outputs/<group_name>/*.tsv to GLUE site to get test score.

Notes

Advanced Details

Below lists the details of the original implementation/paper that are easy to be overlooked and I have taken care of. I found these details are indispensable to successfully replicate the results of the paper.

Optimization

File architecture

If you pretrain, finetune, and generate test results. electra_pytorch will generate these for you.

project root
|
|── datasets
|   |── glue
|       |── <task>
|       ...
|
|── checkpoints
|   |── pretrain
|   |   |── <base_run_name>_<seed>_<percent>.pth
|   |    ...
|   |
|   |── glue
|       |── <group_name>_<task>_<ith_run>.pth
|       ...
|
|── test_outputs
|   |── <group_name>
|   |   |── CoLA.tsv
|   |   ...
|   | 
|   | ...

Citation

Original paper

@inproceedings{clark2020electra,
  title = {{ELECTRA}: Pre-training Text Encoders as Discriminators Rather Than Generators},
  author = {Kevin Clark and Minh-Thang Luong and Quoc V. Le and Christopher D. Manning},
  booktitle = {ICLR},
  year = {2020},
  url = {https://openreview.net/pdf?id=r1xMH1BtvB}
}

This implementation.

@misc{electra_pytorch,
  author = {Richard Wang},
  title = {PyTorch implementation of ELECTRA},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/richarddwang/electra_pytorch}}
}