Loss is Nan - Githubissues

mengmeng0406 commented 2 years ago

Search before asking

[X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

When I use VisDrone dataset to train YOLOv5, after a few epoch, the loss becomes nan, and there is no prediction. Someone said it was a version problem. So I checked my cuda version. The CUDA version of my virtual environment is indeed 10.2, but the CUDA version of system is 11.4. However, when I train with other dataset(UAVDT), this did not happen.

Additional

No response

glenn-jocher commented 2 years ago

@mengmeng0406 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. VisDrone trains correctly in my Goole Colab test just now, I'm not able to reproduce any problems:

!python train.py --data VisDrone.yaml --epochs 100 --cache

We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

✅ Minimal – Use as little code as possible to produce the problem
✅ Complete – Provide all parts someone else needs to reproduce the problem
✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem

For Ultralytics to provide assistance your code should also be:

✅ Current – Verify that your code is up-to-date with GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been solved in master.
✅ Unmodified – Your problem must be reproducible using official YOLOv5 code without changes. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

jedi007 commented 2 years ago

I have the same problem. use: python train.py --data coco128.yaml --weights yolov5n.pt --img 640 --batch-size 2 --epochs 1 --workers 0

20220401111713

I guess it has something to do with "with amp.autocast(enabled=cuda)". I modified this line of code to "with amp.autocast(enabled=False)" Then the problem was solved.

However, the occupation of video memory will be higher.

And There's still a problem with the mAP.

jedi007 commented 2 years ago

All problem was solved by:

Modifiy "with amp.autocast(enabled=cuda)" to "with amp.autocast(enabled=False)"
Add "half=False" twice!!!

korhanpolat commented 2 years ago

@jedi007 thanks a lot for the solution but I'm curious:

How did you find out that this was the problem?
What exatly does this change achieve?
Why didn't you change other similar lines such as scaler = amp.GradScaler(enabled=cuda)

github-actions[bot] commented 2 years ago

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

devashish-bhake commented 1 year ago

I think the given statement of

with amp.autocast(enabled=cuda)

is no longer possible due to some changes done in the training script and also putting

half = False

doesn't seem to make a difference coz in my case too, all the losses are coming as NaNitself.

Can anyone tell what can be the possible explanation to this? Any kind of help is highly appreciated.

Dev-Emree commented 1 year ago

I had all the missing values as nan, first I deleted anaconda, reinstalled it, deleted the old environment and created a new one, but none of them worked. There have been minor changes in the codes in the previous screenshots. I changed it as in the screenshots I added and the problem was solved.

I changed line 308 to with torch.cuda.amp.autocast(False) then I did line 354 half=False

lokeshfallen commented 1 year ago

@Dev-Emree Thanks for the solution with changed train.py script. I made the changes but getting an error like the one in below.

error

lokeshfallen commented 1 year ago

@Dev-Emree Is there anything to do with the versions of CUDA, PyTorch

devashish-bhake commented 1 year ago

@lokeshfallen In my case it is actually working without the error....I am not too sure regarding this error but this error is kind of a memory error in many cases when you are out of memory this error can pop-up with no error message regarding the actual error which is Out-Of-Memory. It is what I have seen and noticed.. correct me if I am wrong.

PS: Also copy paste the error message in your original comment as a text so that if someone faces the same issue and stumbles upon its solution they might put it here as google will provide this chat as a search result

Dev-Emree commented 1 year ago

@Dev-Emree Is there anything to do with the versions of CUDA, PyTorch

My CUDA version is cuda_11.6.r11.6 and Pytorch version is 1.13.1+cu116.

After I changed train.py I continued training via colab and Kaggle. I've been getting error messages on my computer since Yolov4. There were also problems such as prioritizing the internal intel graphics processor due to using a laptop. My advice is to go for the GPUs provided by Kaggle and colab if you have the chance.

VandanVirani commented 1 year ago

so there is a easy way, just decrease the value of batch size, first i tried with 32 batch size, got nan, with 16 got nan, with 4 or 8, gives me measurable loss

glenn-jocher commented 1 year ago

@VandanVirani yes, reducing the batch size can be a solution for the NaN loss problem. This is because a large batch size requires more memory, and if the system does not have enough memory, it may cause NaNs to appear in the training process. Therefore, reducing the batch size reduces the memory requirement and can solve the problem. However, keep in mind that a smaller batch size may also lead to slower convergence and longer training times, so you should balance the batch size based on the available resources and training goals.

ErwanFagnou commented 1 year ago

Got the same issue recently, but found another solution so I am posting it here as it may help others.

All of my losses were Nan, but switching to a smaller batch_size fixed only some of the losses. Two losses were still Nan.

The above solutions suggest that using AMP is the issue. Instead of modifying the source code of the training function, I set amp=False as a parameter of the train function, which resolved the issue:

model.train(data='dataset.yaml', epochs=100, imgsz=640, amp=False)

glenn-jocher commented 1 year ago

@ErwanFagnou thanks for sharing your solution! It's great to hear that setting amp=False as a parameter of the train function resolved the issue for you. This is indeed an alternative approach to handling the NaN loss problem. By disabling automatic mixed precision (AMP), the training process will not use half-precision floating-point arithmetic, which can sometimes lead to numerical stability issues resulting in NaN losses.

Your contribution will definitely help others who are facing a similar problem. Thank you for sharing your solution with the community!

If you have any further questions or need additional assistance, feel free to ask. Happy training!

aswin-roman commented 1 year ago

@glenn-jocher

Still facing this issue..., the nan and zero metrics starts happening around 200 epochs... but its random....

Training setup (4 GPUs, 128 batch size), Training was going fine on the same setup earlier, but now its faulty...

Tried dropping the BS and also tried setting amp = False # instead of check_amp(model) # check AMP

The losses are going nan and the validation metrics are reported as zero

From training log:


2023-08-02T15:00:42.3244865Z     49/1799      6.47G    0.03901   0.003032   0.003846         14        416:  34%|███▍      | 178/517 [00:37<01:13,  4.64it/s]
2023-08-02T15:00:42.5346462Z     49/1799      6.47G    0.03901   0.003032   0.003846         14        416:  35%|███▍      | 179/517 [00:37<01:12,  4.68it/s]
2023-08-02T15:00:42.5347565Z     49/1799      6.47G      0.039   0.003029   0.003832         12        416:  35%|███▍      | 179/517 [00:37<01:12,  4.68it/s]
2023-08-02T15:00:42.7440951Z     49/1799      6.47G      0.039   0.003029   0.003832         12        416:  35%|███▍      | 180/517 [00:37<01:11,  4.70it/s]
2023-08-02T15:00:42.7442694Z     49/1799      6.47G        nan        nan        nan         28        416:  35%|███▍      | 180/517 [00:37<01:11,  4.70it/s]
2023-08-02T15:00:42.9535333Z     49/1799      6.47G        nan        nan        nan         28        416:  35%|███▌      | 181/517 [00:37<01:11,  4.72it/s]
2023-08-02T15:00:42.9536616Z     49/1799      6.47G        nan        nan        nan         21        416:  35%|███▌      | 181/517 [00:38<01:11,  4.72it/s]
2023-08-02T15:00:43.1551855Z     49/1799      6.47G        nan        nan        nan         21        416:  35%|███▌      | 182/517 [00:38<01:10,  4.74it/s]
2023-08-02T15:00:43.1552872Z     49/1799      6.47G        nan        nan        nan         13        416:  35%|███▌      | 182/517 [00:38<01:10,  4.74it/s]
2023-08-02T15:00:43.3647308Z     49/1799      6.47G        nan        nan        nan         13        416:  35%|███▌      | 183/517 [00:38<01:09,  4.80it/s]
2023-08-02T15:00:43.3648373Z     49/1799      6.47G        nan        nan        nan         10        416:  35%|███▌      | 183/517 [00:38<01:09,  4.80it/s]

while validation:

2023-08-02T15:36:58.0300983Z                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 259/259 [00:26<00:00,  9.72it/s]
2023-08-02T15:36:58.0751024Z                    all      16545       7894          0          0          0          0

glenn-jocher commented 1 year ago

@aswin-roman sorry to hear that you're still facing the issue despite trying the suggested solutions. It seems that the loss values become NaN and the validation metrics are reported as zero after a certain number of epochs.

There could be a few reasons for this behavior. One possible cause could be numeric instability due to very small or large values in the training data. It's also worth checking the data preprocessing steps and ensuring that there are no NaN or infinite values in the input data.

Another potential reason could be an issue with the model architecture or configuration. You could try adjusting the learning rate, weight decay, or other hyperparameters to see if it helps stabilize the training process.

It would be helpful to have more information about your training setup and configuration, such as the model architecture, dataset, and training parameters. With more specific details, the community may be able to provide further assistance and guidance.

If possible, you can also try smaller batch sizes to see if it helps in avoiding the NaN loss issue.

Thank you for reporting the problem, and please provide more details if possible so that we can further investigate and assist you in resolving the issue.

aswin-roman commented 1 year ago

@glenn-jocher thank you for the reply.

If it is in-fact a numeric instability due to very small or large values in the training data, then I wounder why is it only being triggered at a later point (after 200 epochs). I dont suspect this part, but I will check this out...

We haven't changed the model architecture or configs..., everything is at default

training setup: 4 NVIDIA GeForce RTX 2080 Ti (11019MiB), batch size: 128, multi GPU training using torchrun model architecture: YOLOv5S Dataset: Image dataset with almost 1 lakh images, 3 object classes training parameters: batch_size 128 --epochs 1800 --model yolov5s.pt --task train, workers=16, exist_ok=True, cache='disk'

Also tried reducing the batch size to 96, that didn't help...!!!

aswin-roman commented 1 year ago

@glenn-jocher , I was able to solve the metrics dropping to zero by removing the imgsz parameter from training. The imgsz=416 was mentioned as a parameter (Our dataset consists of sqr images of 416x416). After removing the imgsz = 416, the training is proceeding till the planned end. The "nan" is still seen at some points in the training log...!

glenn-jocher commented 10 months ago

@aswin-roman I'm glad to hear that removing the imgsz parameter from the training has resolved the issue of the metrics dropping to zero. The fact that the training is proceeding till the planned end is definitely a positive sign.

Regarding the intermittent appearance of "nan" in the training log, it may be related to gradient exploding or vanishing, which can cause difficulties in optimization. One potential approach to mitigate this is to try different weight initialization strategies or adjust the learning rate schedule. Additionally, introducing gradient clipping may help stabilize the training process and prevent the occurrence of "nan" values.

Your detailed insights into the issue and the steps you've taken will greatly assist the community in addressing similar challenges. Thank you for sharing your findings and I hope these additional suggestions help in further stabilizing your training. If you have any more questions or encounter further issues, please don't hesitate to ask.

devashish-bhake commented 9 months ago

I completely agree with @glenn-jocher . In my opinion it might be an exploding gradient problem or atleast by the looks of it. If I were in your position, I would definitely try out a larger model like the yolov5m or the yolov5l or something to see if the gradients are getting correctly flattened out. Sometimes when there are small number of nodes to update, the model can updated their values drastically, so if we have a larger no. of nodes to update, there is a chance that the gradients will not explode. It is what I think...so do correct me if I am wrong about it.

glenn-jocher commented 9 months ago

That's a great suggestion, @devashish-bhake! Experimenting with a larger model like YOLOv5m or YOLOv5l could indeed help in diagnosing the issue and determining if the gradient behavior improves. Using a larger model might distribute the weight updates across a larger number of nodes, potentially mitigating the gradient exploding issue.

Your insightful approach and willingness to explore different strategies is commendable. We appreciate your active contribution to the community's shared knowledge and problem-solving process.

If you have any further observations or need additional assistance during your experimentation with different model sizes, feel free to share them here. Your input and insights are valuable to the community.

ultralytics / yolov5

Loss is Nan #6907

Search before asking

Question

Additional

How to create a Minimal, Reproducible Example