ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.65k stars 16.33k forks source link

The reason for NaN #12591

Closed KwangryeolPark closed 10 months ago

KwangryeolPark commented 10 months ago

Search before asking

YOLOv5 Component

Training

Bug

Like other issues, I also see NaN during training yolov5m to coco dataset following the script in coco.yaml and README.md.

I try to figure out the reason for NaN and I find a hint in a Issue which indirectly is about amp (Auto Mixed Precision).

It makes sense that low precission has a higher chance to occur NaN during casting because of Underflow.

Therefore, I think, lots of NaN problem come from amp so I looks better to use NVIDIA apex which uses distribution shift to prevent distribution miss match.

Environment

YOLOv5m torch:1.12.1+cu116 python: 3.8.12 dataset: coco optimizer: CAME epochs: 300 batch size: 40

Minimal Reproducible Example

python train.py --data coco.yaml --epochs 300 --weights '' --cfg yolov5m.yaml --batch-size 40 --optimizer CAME --device 0

I use CAME optimizer with betas=(momentum, 0.999, 0.999)

Additional

No response

Are you willing to submit a PR?

github-actions[bot] commented 10 months ago

👋 Hello @KwangryeolPark, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv5 CI

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics
KwangryeolPark commented 10 months ago

I hope you fix the mixed precision problem.

glenn-jocher commented 10 months ago

@KwangryeolPark hello! Thanks for bringing this to our attention. NaNs during training can indeed sometimes be related to precision issues when using mixed precision training (AMP). However, there could be other factors at play, such as learning rate, weight initialization, or data preprocessing.

Regarding the use of NVIDIA apex, YOLOv5 uses PyTorch's native AMP implementation, which is generally recommended for its ease of use and integration. If you're experiencing NaNs with AMP, you might want to try the following:

  1. Reduce the learning rate.
  2. Increase the batch size if possible, as smaller batches can sometimes lead to instability with AMP.
  3. Ensure that your data preprocessing is correct and that there are no anomalies in the dataset.
  4. Experiment with different optimizers if the issue persists.

If you're willing to submit a PR, we'd be happy to review any improvements or fixes you propose. Just make sure to thoroughly test your changes to ensure they're beneficial across various scenarios.

Remember to check out our documentation for more details on troubleshooting and best practices: https://docs.ultralytics.com/yolov5/

Thanks for your contribution to the YOLOv5 community! 🚀

KwangryeolPark commented 10 months ago

@glenn-jocher Thank you for answer.

In order to set learning-rate, I see Training Arguments and find lr0 argument. However, when I add --lr0 0.001, the script shows train.py: error: unrecognized arguments: --lr0 1e-3.

glenn-jocher commented 10 months ago

Apologies for the confusion, @KwangryeolPark. The correct argument for setting the initial learning rate in the YOLOv5 training script is --lr. So, if you want to set the initial learning rate to 0.001, you should use the following command:

python train.py --data coco.yaml --epochs 300 --weights '' --cfg yolov5m.yaml --batch-size 40 --optimizer CAME --device 0 --lr 0.001

Make sure to adjust the learning rate according to your specific needs and keep an eye on the training process to ensure stability. If you have any further questions or issues, don't hesitate to reach out. Happy training! 🚀

KwangryeolPark commented 10 months ago

@glenn-jocher Thank you for guidance. However, --lr 0.001 argument also occur: train.py: error: unrecognized arguments: --lr 0.001

glenn-jocher commented 10 months ago

I apologize for the oversight, @KwangryeolPark. In YOLOv5, the learning rate is set in the hyperparameter configuration file rather than as a command-line argument. You can adjust the learning rate by editing the hyp.scratch.yaml file or any other hyperparameter file you are using.

For example, to set the initial learning rate to 0.001, you would modify the lr0 value in your hyperparameter file like so:

lr0: 0.001  # initial learning rate

Then, you can reference this hyperparameter file during training using the --hyp argument:

python train.py --data coco.yaml --epochs 300 --weights '' --cfg yolov5m.yaml --batch-size 40 --optimizer CAME --device 0 --hyp your_hyperparameter_file.yaml

Replace your_hyperparameter_file.yaml with the path to your edited hyperparameter file. This should correctly set the initial learning rate for your training session. If you encounter any further issues, please let us know. Good luck with your training! 🌟

KwangryeolPark commented 10 months ago

Thank you

glenn-jocher commented 10 months ago

You're welcome, @KwangryeolPark! If you have any more questions or need further assistance in the future, feel free to reach out. Best of luck with your YOLOv5 training! Happy detecting! 🚀👀