Closed MichaelMonashev closed 3 years ago
@MichaelMonashev I'm probably not going to be able to help you there, you're using your own training code (or modified) and not any of the public datasets. I've trained all of them with good resuls. The only nans tend to be in the beginning if LR is too agressive, wrong optimizer is used, or batch sizes are too small. Late epoch nan are usually indicative of a different sort of problem.
You might want to try enabling torch.autograd.detect_anomaly() ... start hunting for the first origin of the nan...
Don't think there is likely to be an actual issue with the model here. If reproduction can be made w/ the training code here I'll look at it. These models, the focal loss specifically isn't the most stable and the official impl has plenty of questions about nan issues a well. It's best to stick within range of recommending learning rates, optimizers, etc unless you know how tune / debug those hparams
@rwightman , I had hardware problems. I change GPUs and i am testing training code now. After some hours I did not see nans in loss.
I am training efficientdet and geting
nan
in loss after some epochs. I restart training, train some epochs and getingnan
in loss again. Now I constantly getnan
in loss in 86-th epoch and can not go ahead.I am using gradient clipping:
My train log:
my summary log: