Closed JChaloton closed 2 weeks ago
Hi there! If your training encounters a NaN loss, a good first step is to try a range of different learning rates. In particular, test different magnitudes (e.g., if your current learning rate is $1e-4$, you might want to try $1e-3$ and $1e-5$). Keep in mind that this advice applies to any deep learning training scenario—not just in this case. Personally, I haven’t encountered this issue with SPOTER. Let me know if that helps!
Hello author! sorry for the late reply. I have tried changing the learning rate ranging from 1e-3 to 1e-5 and after a couple of epochs, it will change to NaN. I am using the preprocessed data provided in the dataset section and am unsure if that could be the problem. I am a little lost right now. Do you have any suggestions?
I'd encourage you to debug the inputs and outputs to the model. If you're running locally, you can look at the input and output tensors in your debugger; if you're running remotely, you can print this. That might uncover some issues you're currently not aware of.
Feel free to post those logs here if they won't make the issue clear to you.
I have tried to do a strict fresh install on the new dependencies and python. I may have updated some of the dependencies which caused some conflicts within the code. I finally got it to work now.
May I ask you for some clarification about the accuracy score? Does this [350] TRAIN loss: 0.06932155699075178 acc: 0.9778085991678225 [350] VALIDATION acc: 0.543026706231454
mean the training accuracy is 97.78% and validation accuracy is 54.3%?
looking forward to your response!
Awesome. And yes, that is right! If you have any other questions that don't relate to this NaN loss issue, feel free to open a new thread. Otherwise, we can close this thread to let folks know you managed to solve the issue.
Thank you very much! I will be closing this issue to let others know it has been resolved.
Just a little reminder to myself and other beginners in this field, the python and dependencies version is very important. Please follow the specified version strictly!
Thank you again author, for your work.
Hello author. Thank you for your work on this project. After I tried to run the model with the data set provided, the accuracy level doesn't seem to change. After the fourth epoch, the TRAIN loss became NaN and the accuracy will not change further.
[1] TRAIN loss: 4.760449286472781 acc: 0.0062413314840499305 [1] VALIDATION acc: 0.008902077151335312
[2] TRAIN loss: 4.682462440945735 acc: 0.010402219140083218 [2] VALIDATION acc: 0.011869436201780416
[3] TRAIN loss: 4.676414796614944 acc: 0.012482662968099861 [3] VALIDATION acc: 0.011869436201780416
[4] TRAIN loss: nan acc: 0.012482662968099861 [4] VALIDATION acc: 0.017804154302670624
[5] TRAIN loss: nan acc: 0.020804438280166437 [5] VALIDATION acc: 0.017804154302670624
This is my training parameter:
python -m train --experiment_name WLASL_test --epochs 30 --lr 0.0001 --num_classes 100 --training_set_path /home/jirapong/spoter/spoter/datasets/WLASL100_train_25fps.csv --validation_set_path /home/jirapong/spoter/spoter/datasets/WLASL100_val_25fps.csv --testing_set_path /home/jirapong/spoter/spoter/datasets/WLASL100_test_25fps.csv
These are the test loss and learning rate graph:
May I ask what would cause this?