matyasbohacek / spoter

Repository accompanying the "Sign Pose-based Transformer for Word-level Sign Language Recognition" paper
https://spoter.signlanguagerecognition.com
Apache License 2.0
78 stars 24 forks source link

Train loss became NaN and the accuracy won't change after a couple of epochs #15

Open JChaloton opened 5 days ago

JChaloton commented 5 days ago

Hello author. Thank you for your work on this project. After I tried to run the model with the data set provided, the accuracy level doesn't seem to change. After the fourth epoch, the TRAIN loss became NaN and the accuracy will not change further.

[1] TRAIN loss: 4.760449286472781 acc: 0.0062413314840499305 [1] VALIDATION acc: 0.008902077151335312

[2] TRAIN loss: 4.682462440945735 acc: 0.010402219140083218 [2] VALIDATION acc: 0.011869436201780416

[3] TRAIN loss: 4.676414796614944 acc: 0.012482662968099861 [3] VALIDATION acc: 0.011869436201780416

[4] TRAIN loss: nan acc: 0.012482662968099861 [4] VALIDATION acc: 0.017804154302670624

[5] TRAIN loss: nan acc: 0.020804438280166437 [5] VALIDATION acc: 0.017804154302670624

This is my training parameter: python -m train --experiment_name WLASL_test --epochs 30 --lr 0.0001 --num_classes 100 --training_set_path /home/jirapong/spoter/spoter/datasets/WLASL100_train_25fps.csv --validation_set_path /home/jirapong/spoter/spoter/datasets/WLASL100_val_25fps.csv --testing_set_path /home/jirapong/spoter/spoter/datasets/WLASL100_test_25fps.csv

These are the test loss and learning rate graph: WLASL_test_lr WLASL_test_loss

May I ask what would cause this?

matyasbohacek commented 4 days ago

Hi there! If your training encounters a NaN loss, a good first step is to try a range of different learning rates. In particular, test different magnitudes (e.g., if your current learning rate is $1e-4$, you might want to try $1e-3$ and $1e-5$). Keep in mind that this advice applies to any deep learning training scenario—not just in this case. Personally, I haven’t encountered this issue with SPOTER. Let me know if that helps!

JChaloton commented 51 minutes ago

Hello author! sorry for the late reply. I have tried changing the learning rate ranging from 1e-3 to 1e-5 and after a couple of epochs, it will change to NaN. I am using the preprocessed data provided in the dataset section and am unsure if that could be the problem. I am a little lost right now. Do you have any suggestions?

matyasbohacek commented 39 minutes ago

I'd encourage you to debug the inputs and outputs to the model. If you're running locally, you can look at the input and output tensors in your debugger; if you're running remotely, you can print this. That might uncover some issues you're currently not aware of.

matyasbohacek commented 38 minutes ago

Feel free to post those logs here if they won't make the issue clear to you.