Closed xiamengzhou closed 2 years ago
Some random thoughts:
Thanks for your quick response!
I will try to rerun the model with openwebtext again. Feel free to close the issue if you couldn't spot any obvious issues. Thanks again for your help!
I think that I am having the same problem: I pre-trained a small model on a different language and then fine-tuned it on a NER task. The random model achieves ~80% while the final pre-trained model achieves ~20%. However, I am not able to faithfully reproduce the error. There seems to be a critical point during the pre-training, after which this behavior occurs. When I evaluate early models (e.g. after 0.0001% or after 0.1%), the model performs similar to the random model. I even got a model after 126 000 steps that achieves 90%, so this model is clearly better than the random model. However, later checkpoints achieve only 20%.
My loss curve looks similar to xiamengzhous:
Help would be greatly appreciated! Thanks in advance.
It seems that the model loss reaches nan at some point if I use the fp16 setting. With full precision, the model is properly trained.
Thank you very much. I tried it simply uncommenting line 417 in the pre-training script. However, the results stayed the same. Could you please elaborate whether I have to change something else? Thanks in advance.
Since the pretraining loss didn't reach nan,so you may want to try to comment out line 423 in finetune.py.
Thank you @richarddwang for the suggestion. Unfortunately, I don't use your fine-tuning script, so the error cannot come from there. But your answer suggests that the loss should be NaN if this was the problem? I thought so, but the fact that the loss of @xiamengzhou looks so normal made me uncertain.
From my experience, if pretraining loss didn't get nan, than there won't be any problem with finetuning. (but xiameng seems to have encountered with some problems that can be solved by disabling mixed precision)
My recommendation would be testing the pretrained checkpoint with my finetuning script to know where the problem resides in.
If you got reasonable results with my finetuning script, you may want to compare it to your script and see what could be the problem.
Thank you for your suggestion. I will try your scripts out and see whether I get the same error.
When analyzing the wandb stats, I realized, that the gradients of the discriminator look very strange:
This behavior occurs in all gradient plots of the discriminator, and I find it tempting to connect it to the problem. However, I will now try your scripts and report my observations afterwards.
Hi, if you still need help. Please open another issue.
For the following people. The summay is:
Hi Richard,
Thanks for providing the implementation to pretrain ELECTRA in pytorch!
I tried pre-training an ELECTRA-small model with wikipedia data and selected the 25% trained model (updated for 250k steps) to fine-tune on SST-2. At the end of each epoch, the validation accuracy of SST-2 is the same around 50%. It seems to be an optimization issue if the accuracy stays the same throughout the whole training process. Do you have any idea about why it happens? Thank you!!