Loss does not decrease, and the model is not trained(Loss不下降，模型没有得到训练)

jackyjsy / CVPR21Chal-SLR

This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Creative Commons Zero v1.0 Universal

205 stars 51 forks source link

Loss does not decrease, and the model is not trained(Loss不下降，模型没有得到训练) #7

Closed LiangSiyv closed 3 years ago

LiangSiyv commented 3 years ago

Hi, honorable competition winner. I deployed your code locally, but the results of many rounds of training seem to be that the model has not been trained, and the results are still random. The accuracy rate can only reach around 1/226.

For your code, the only modification I made was that batch-size is set to 2 for resource constrain, and the 512512 data was directly resized to 256256 during data preprocessing.

No other changes were made.

If it is possible, do you know the reason for my problem?

（作者您好，我在本地部署了代码，但是多轮训练得到的结果好像模型没有得到训练，结果仍然是随机的。准确率只能达到1/226附近。对于您的代码，我做的修改仅为资源受限batchsize设为2，数据预处理的时候将512512的数据直接resize成256256.除此之外未作改动。如果有可能的话请问您知道我遇到问题的原因吗？）

捕获.PNG

jackyjsy commented 3 years ago

Hi Siyu, thanks for your interests and I am happy to help. I assume that you are talking about training the Conv3D model, right? The batch size needs to be large to converge. I tried using a smaller batch size of 8 in our previous experiment and the model doesn't converge well. I believe the reason is on the sampling strategy and data augmentation techniques used in the data loader.

For example, when we sample 32 consecutive frames from the videos (length: 50 to 150 frames), they may not contain enough information to optimize the model especially when the batch size is extremely small.

If the hardware configuration limits the batch size, I recommend to increase your batch size to at least 12 by sampling lesser frames (e.g. 16 frames), choosing a smaller model capacity or replacing label smoothing with normal cross entropy loss.

LiangSiyv commented 3 years ago

Thanks for your reply! I know the model may not converge because of the extremely small batch size, but the ACC is always 0%. It's like the model is not trained.
I will try your advisement and then reply to this issue. Thank you again!

snorlaxse commented 3 years ago

Since the hardware configuration limits the batch size, I set the batch size as 16 by sampling the 32 frames. However, the acc on test set is only 86.99%, which is far below the metric (97.81%) provided by 'final_models_finetuned/rgb_final_finetuned.pth'.

No other changes were made.

If it is possible, do you know the reason for my problem?