YOLOF NAN and loss fluctuations

open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark

https://mmdetection.readthedocs.io

Apache License 2.0

29.41k stars 9.43k forks source link

YOLOF NAN and loss fluctuations #11537

Open sparshgarg23 opened 7 months ago

sparshgarg23 commented 7 months ago

I am training the yolof model and I noticed it's mentioned in the readme that there are some instabilities in the current model,which results in NAN and loss fluctuation issues.

As such ,wanted to know as to why is the NAN issue occuring here.I had earlier trained SOLOv2 and FCOS and didn't notice any NAN issues in those models.

Is it because of the following 1.Parameters being chosen for the model training are resulting in the gradient exploding. 2.MMengine/MMCV recent updates 3.The nans are because of how the model is implemented and should be expected with all visual transformers models.

sparshgarg23 commented 7 months ago

So a bit of an update. 1.I tried following the changes mentioned in this link https://github.com/thisisi3/Paddle-YOLOF/issues/1#issuecomment-1115545926. However that doesn't change the problem of NAN.

Making sure that the scheduler slows decreases the learning rate to 0.084 and not to 0.1 helps in avoding the NAN problem,but it affects the learning process.After one epoch the overall mAP comes out to be (drum roll) 0. unfortunately i don't think that even decreasing the learning rate helps ,as the mAP and AR for small,medium and large is stuck at 0.

sparshgarg23 commented 7 months ago

Please ensure that YOLOF is working correctly .As mentioned in earlier comments,the evaluation results even after decreasing the learning rate are coming out to be 0.

Even though the model's loss is decreasing,when I evaluate the model on test images there is no result or bounding box being drawn.Instead I am getting the error that evaluation couldn't be done because the entire test directory is empty.

slantingsun commented 7 months ago

Me too, I thought it was a problem with my dataset, but it's not. Adjusting to the parameters is difficult