Strangely, under the referit_bert condition, the reproduced result I achieved is only 66.3. The only change I made was due to insufficient GPU memory, where I used gradient accumulation to simulate batch size 64 with batch size 32. Furthermore, the result of 66.3 was only reached at epoch 150, while at epoch 100, it was less than 65.7. My environment is CUDA 11.6, PyTorch 11.3, and BERT 0.6.2. I wonder if you have any guesses about the possible reasons for this phenomenon. It doesn't seem to be a problem with the module's performance but rather some other unknown detail.
Strangely, under the referit_bert condition, the reproduced result I achieved is only 66.3. The only change I made was due to insufficient GPU memory, where I used gradient accumulation to simulate batch size 64 with batch size 32. Furthermore, the result of 66.3 was only reached at epoch 150, while at epoch 100, it was less than 65.7. My environment is CUDA 11.6, PyTorch 11.3, and BERT 0.6.2. I wonder if you have any guesses about the possible reasons for this phenomenon. It doesn't seem to be a problem with the module's performance but rather some other unknown detail.