Yolov8 nano halts during training, after "checks passed ✅"

tjasmin111 commented 1 month ago

Search before asking

[X] I have searched the YOLOv8 issues and discussions and found no similar questions.

Question

I'm trying to train a Yolov8-cls Nano classifier on a dataset size of 350K images. Idk why the training gets halted and don't proceeds after the checks passed. I'm not sure if this is an issue with the machine or a maybe Nano can not handle that many images/memory requirements. Any advice on this?

^[[34m^[[1mtrain:^[[0m /home/dataset/v1/train... found 312245 images in 15 classes ✅
^[[34m^[[1mval:^[[0m /home/dataset/v1/val... found 35670 images in 15 classes ✅
^[[34m^[[1mtest:^[[0m None...
Overriding model.yaml nc=1000 with nc=13
                   from  n    params  module                                       arguments
  0                  -1  1       464  ultralytics.nn.modules.conv.Conv             [3, 16, 3, 2]
  1                  -1  1      4672  ultralytics.nn.modules.conv.Conv             [16, 32, 3, 2]
  2                  -1  1      7360  ultralytics.nn.modules.block.C2f             [32, 32, 1, True]
  3                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 2]
  4                  -1  2     49664  ultralytics.nn.modules.block.C2f             [64, 64, 2, True]
  5                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]
  6                  -1  2    197632  ultralytics.nn.modules.block.C2f             [128, 128, 2, True]
  7                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]
  8                  -1  1    460288  ultralytics.nn.modules.block.C2f             [256, 256, 1, True]
  9                  -1  1    346893  ultralytics.nn.modules.head.Classify         [256, 13]
YOLOv8n-cls summary: 99 layers, 1454941 parameters, 1454941 gradients, 3.4 GFLOPs
Transferred 156/158 items from pretrained weights
^[[34m^[[1mTensorBoard: ^[[0mStart with 'tensorboard --logdir runs/classify/train2', view at http://localhost:6006/
^[[34m^[[1mAMP: ^[[0mrunning Automatic Mixed Precision (AMP) checks with YOLOv8n...
^[[34m^[[1mAMP: ^[[0mchecks passed ✅

Additional

No response

glenn-jocher commented 1 month ago

Hey there!

It seems like your training session may be experiencing a snag post-checks. Given the large dataset size, it's possible that the halt could be related to memory constraints. YOLOv8 Nano is designed to be lightweight and efficient, but the sheer volume of your dataset might indeed push its limits. Here are a couple of suggestions that might help:

Try reducing the number of images in your dataset or resizing your images to ensure they're not too large.
Monitor your system's resources (CPU, GPU, RAM) during the initial stages of training to identify potential bottlenecks.

If the issue persists, consider sharing the logs right before the halt occurs for a deeper dive. Additionally, running the training with a smaller subset of your dataset could also provide some insights.

Happy to assist further if needed! 😊

tjasmin111 commented 1 month ago

But it worked for a Small model before. And I'm already passing a 320 imgsz. Also my systems' RAM and GPU are large enough and shouldn't be the problem.

Where are the logs stored? How can I get them?

glenn-jocher commented 1 month ago

Hi there! 🌟

Great to hear that it worked with a Smaller model and you're already using a 320 imgsz. If your system's resources are sufficient, let's look into the logs for more clues.

Logs for YOLOv8 training sessions, including any errors or warnings, are typically stored in the runs/train/exp* directories, with detailed TensorBoard logs in runs/train/exp*/events.out.tfevents.*.

To access TensorBoard logs and visualize your training progress, you can run:

tensorboard --logdir runs/train

and then open http://localhost:6006/ in your browser.

If you encounter any specific errors in those logs, feel free to share here for further assistance! 🚀

github-actions[bot] commented 2 weeks ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

ultralytics / ultralytics