Open JChaloton opened 5 days ago
Hi there! If your training encounters a NaN loss, a good first step is to try a range of different learning rates. In particular, test different magnitudes (e.g., if your current learning rate is $1e-4$, you might want to try $1e-3$ and $1e-5$). Keep in mind that this advice applies to any deep learning training scenario—not just in this case. Personally, I haven’t encountered this issue with SPOTER. Let me know if that helps!
Hello author! sorry for the late reply. I have tried changing the learning rate ranging from 1e-3 to 1e-5 and after a couple of epochs, it will change to NaN. I am using the preprocessed data provided in the dataset section and am unsure if that could be the problem. I am a little lost right now. Do you have any suggestions?
I'd encourage you to debug the inputs and outputs to the model. If you're running locally, you can look at the input and output tensors in your debugger; if you're running remotely, you can print this. That might uncover some issues you're currently not aware of.
Feel free to post those logs here if they won't make the issue clear to you.
Hello author. Thank you for your work on this project. After I tried to run the model with the data set provided, the accuracy level doesn't seem to change. After the fourth epoch, the TRAIN loss became NaN and the accuracy will not change further.
[1] TRAIN loss: 4.760449286472781 acc: 0.0062413314840499305 [1] VALIDATION acc: 0.008902077151335312
[2] TRAIN loss: 4.682462440945735 acc: 0.010402219140083218 [2] VALIDATION acc: 0.011869436201780416
[3] TRAIN loss: 4.676414796614944 acc: 0.012482662968099861 [3] VALIDATION acc: 0.011869436201780416
[4] TRAIN loss: nan acc: 0.012482662968099861 [4] VALIDATION acc: 0.017804154302670624
[5] TRAIN loss: nan acc: 0.020804438280166437 [5] VALIDATION acc: 0.017804154302670624
This is my training parameter:
python -m train --experiment_name WLASL_test --epochs 30 --lr 0.0001 --num_classes 100 --training_set_path /home/jirapong/spoter/spoter/datasets/WLASL100_train_25fps.csv --validation_set_path /home/jirapong/spoter/spoter/datasets/WLASL100_val_25fps.csv --testing_set_path /home/jirapong/spoter/spoter/datasets/WLASL100_test_25fps.csv
These are the test loss and learning rate graph:
May I ask what would cause this?