Open Aurora-zgz opened 1 week ago
I checked the output log and it seems to be caused by overfitting. So, how can I adjust the training file to achieve the results in the paper?
Hi @Aurora-zgz, thank you for the question. Have you applied early stopping? We choose the checkpoint that achieves the lowest mean absolute error on the validation set to prevent overfitting (so we do not end up training for the full 1000 epochs).
We also used a batch size of 8. Have you reproduced the original results without distributed training first?
Here are the training logs from one of the original CounTX training runs: https://api.wandb.ai/links/niki-oxford/zzs0ewhf. I used Weights & Biases to log the results. The validation set MAE stopped improving after 535/1000 epochs.
Thank you for your detailed answer. Due to limited resources, I only trained 200 epochs using a single GPU before. Therefore, in order to speed up the process through distributed training, some issues may arise during this period. Next, I will use a single GPU to reproduce and attempt to use early stopping and set bach_Size=8 during distributed training.
Great, please feel free to continue to ask questions, and I will do my best to help!
Hello, I have some questions to consult with you. I am trying to change the training file to distributed training. But after executing 1000 epochs, the result was very poor, with batch_size=16. Test MAE: 44.09, Test RMSE: 117.41。 Did an error occur during the process of modifying the training file. Could you please take some of your time to help check the issue? Thank you very much. The modified content is as follows: