SST-2 accuracy is 50% after finetuning

richarddwang / electra_pytorch

Pretrain and finetune ELECTRA with fastai and huggingface. (Results of the paper replicated !)

324 stars 41 forks source link

SST-2 accuracy is 50% after finetuning #34

Closed xiamengzhou closed 2 years ago

xiamengzhou commented 2 years ago

Hi Richard,

Thanks for providing the implementation to pretrain ELECTRA in pytorch!

I tried pre-training an ELECTRA-small model with wikipedia data and selected the 25% trained model (updated for 250k steps) to fine-tune on SST-2. At the end of each epoch, the validation accuracy of SST-2 is the same around 50%. It seems to be an optimization issue if the accuracy stays the same throughout the whole training process. Do you have any idea about why it happens? Thank you!!

richarddwang commented 2 years ago

Some random thoughts:

Have you tried finetuning on other tasks (e.g. CoLA)?
Is the pretrained model succefully loaded instead of using a randomly initialized model when finetuning?
Did you enable logging for pretraining? Does learning rate or loss curve look reasonable?

xiamengzhou commented 2 years ago

Thanks for your quick response!

I haven't finetuned on other tasks but will try
Yes, the model is successfully loaded. A random initialized model can achieve around 80% accuracy on SST-2 but my model only achieves 50% accuracy :(
The loss curve is shown below and it looks normal to me:

I will try to rerun the model with openwebtext again. Feel free to close the issue if you couldn't spot any obvious issues. Thanks again for your help!

ghost commented 2 years ago

I think that I am having the same problem: I pre-trained a small model on a different language and then fine-tuned it on a NER task. The random model achieves ~80% while the final pre-trained model achieves ~20%. However, I am not able to faithfully reproduce the error. There seems to be a critical point during the pre-training, after which this behavior occurs. When I evaluate early models (e.g. after 0.0001% or after 0.1%), the model performs similar to the random model. I even got a model after 126 000 steps that achieves 90%, so this model is clearly better than the random model. However, later checkpoints achieve only 20%.

My loss curve looks similar to xiamengzhous: loss

Help would be greatly appreciated! Thanks in advance.

xiamengzhou commented 2 years ago

It seems that the model loss reaches nan at some point if I use the fp16 setting. With full precision, the model is properly trained.

ghost commented 2 years ago

Thank you very much. I tried it simply uncommenting line 417 in the pre-training script. However, the results stayed the same. Could you please elaborate whether I have to change something else? Thanks in advance.

richarddwang commented 2 years ago

Since the pretraining loss didn't reach nan,so you may want to try to comment out line 423 in finetune.py.

ghost commented 2 years ago

Thank you @richarddwang for the suggestion. Unfortunately, I don't use your fine-tuning script, so the error cannot come from there. But your answer suggests that the loss should be NaN if this was the problem? I thought so, but the fact that the loss of @xiamengzhou looks so normal made me uncertain.

richarddwang commented 2 years ago

From my experience, if pretraining loss didn't get nan, than there won't be any problem with finetuning. (but xiameng seems to have encountered with some problems that can be solved by disabling mixed precision)

My recommendation would be testing the pretrained checkpoint with my finetuning script to know where the problem resides in.

If you got reasonable results with my finetuning script, you may want to compare it to your script and see what could be the problem.

ghost commented 2 years ago

Thank you for your suggestion. I will try your scripts out and see whether I get the same error.

When analyzing the wandb stats, I realized, that the gradients of the discriminator look very strange: wandb

This behavior occurs in all gradient plots of the discriminator, and I find it tempting to connect it to the problem. However, I will now try your scripts and report my observations afterwards.

richarddwang commented 2 years ago

Hi, if you still need help. Please open another issue.

For the following people. The summay is:

If you find loss reached nan, try to disable fp16
If you write your own finetuning script and it didn't work, please make sure the finetuning script written here gets you reasonable results first.