matyasbohacek / spoter

Repository accompanying the "Sign Pose-based Transformer for Word-level Sign Language Recognition" paper
https://spoter.signlanguagerecognition.com
Apache License 2.0
80 stars 24 forks source link

Train loss became NaN and the accuracy won't change after a couple of epochs #15

Closed JChaloton closed 2 weeks ago

JChaloton commented 3 weeks ago

Hello author. Thank you for your work on this project. After I tried to run the model with the data set provided, the accuracy level doesn't seem to change. After the fourth epoch, the TRAIN loss became NaN and the accuracy will not change further.

[1] TRAIN loss: 4.760449286472781 acc: 0.0062413314840499305 [1] VALIDATION acc: 0.008902077151335312

[2] TRAIN loss: 4.682462440945735 acc: 0.010402219140083218 [2] VALIDATION acc: 0.011869436201780416

[3] TRAIN loss: 4.676414796614944 acc: 0.012482662968099861 [3] VALIDATION acc: 0.011869436201780416

[4] TRAIN loss: nan acc: 0.012482662968099861 [4] VALIDATION acc: 0.017804154302670624

[5] TRAIN loss: nan acc: 0.020804438280166437 [5] VALIDATION acc: 0.017804154302670624

This is my training parameter: python -m train --experiment_name WLASL_test --epochs 30 --lr 0.0001 --num_classes 100 --training_set_path /home/jirapong/spoter/spoter/datasets/WLASL100_train_25fps.csv --validation_set_path /home/jirapong/spoter/spoter/datasets/WLASL100_val_25fps.csv --testing_set_path /home/jirapong/spoter/spoter/datasets/WLASL100_test_25fps.csv

These are the test loss and learning rate graph: WLASL_test_lr WLASL_test_loss

May I ask what would cause this?

matyasbohacek commented 3 weeks ago

Hi there! If your training encounters a NaN loss, a good first step is to try a range of different learning rates. In particular, test different magnitudes (e.g., if your current learning rate is $1e-4$, you might want to try $1e-3$ and $1e-5$). Keep in mind that this advice applies to any deep learning training scenario—not just in this case. Personally, I haven’t encountered this issue with SPOTER. Let me know if that helps!

JChaloton commented 2 weeks ago

Hello author! sorry for the late reply. I have tried changing the learning rate ranging from 1e-3 to 1e-5 and after a couple of epochs, it will change to NaN. I am using the preprocessed data provided in the dataset section and am unsure if that could be the problem. I am a little lost right now. Do you have any suggestions?

matyasbohacek commented 2 weeks ago

I'd encourage you to debug the inputs and outputs to the model. If you're running locally, you can look at the input and output tensors in your debugger; if you're running remotely, you can print this. That might uncover some issues you're currently not aware of.

matyasbohacek commented 2 weeks ago

Feel free to post those logs here if they won't make the issue clear to you.

JChaloton commented 2 weeks ago

I have tried to do a strict fresh install on the new dependencies and python. I may have updated some of the dependencies which caused some conflicts within the code. I finally got it to work now. May I ask you for some clarification about the accuracy score? Does this [350] TRAIN loss: 0.06932155699075178 acc: 0.9778085991678225 [350] VALIDATION acc: 0.543026706231454 mean the training accuracy is 97.78% and validation accuracy is 54.3%?

test_loss

looking forward to your response!

matyasbohacek commented 2 weeks ago

Awesome. And yes, that is right! If you have any other questions that don't relate to this NaN loss issue, feel free to open a new thread. Otherwise, we can close this thread to let folks know you managed to solve the issue.

JChaloton commented 2 weeks ago

Thank you very much! I will be closing this issue to let others know it has been resolved.

Just a little reminder to myself and other beginners in this field, the python and dependencies version is very important. Please follow the specified version strictly!

Thank you again author, for your work.