ultralytics / ultralytics

Ultralytics YOLO11 πŸš€
https://docs.ultralytics.com
GNU Affero General Public License v3.0
31.36k stars 6.03k forks source link

YOLOv8: Declining mAPs During GPU Training #5338

Closed cisco-silva closed 11 months ago

cisco-silva commented 1 year ago

Search before asking

YOLOv8 Component

Train

Bug

Hi! When I train the YOLOv8 model on my GPU, I have noticed that the mAP metrics consistently decrease (loss metrics increase) or become erratic instead of improving after each epoch. This is a perplexing issue since running the same code on my CPU yields the expected results, with mAP and loss improving as the number of epochs increases.

I am using the following configuration: Ubuntu 20.04 Python 3.11.5 NVIDIA GTX 1650 NVIDIA Driver 525.125.06 CUDA 11.8 cuDNN 8.6 Torch 2.1.0+cu118 Ultralytics YOLOv8.0.196

Observation:

I have included below two images that illustrate the problem I am facing:

  1. Shows the model metrics after 54 epochs in the GPU. The model was supposed to do 60 epochs but it stopped at epoch 54 saying that it did not observe any improvement in the last 50 epochs, and selected the results at epoch 4 as the best.pt
  2. Shows the model metrics after 60 epochs in the CPU

1 -Results with GPU results_GPU

2 -Results with CPU (same code except for device=cpu) results_CPU

I have already tried various troubleshooting steps, including checking my GPU drivers, ensuring the correct CUDA and cuDNN versions, but the problem persists

Environment

No response

Minimal Reproducible Example

from ultralytics import YOLO
model = YOLO('yolov8n.pt')
results = model.train(data='coco128.yaml', epochs=60, device=0)

Additional

No response

Are you willing to submit a PR?

[EDIT]

@zshariff6506, thanks for your suggestion!

(I'm editing my initial post to reply to you because, for some reason, I'm unable to add new comments without closing the issue)

I tried running the code a few more times with epochs = 100, patience = 100 and even ran one with epochs = 200, patience = 200, but the results remained similar, unfortunately. The mAP and loss functions persist in a strange behavior, although there is a slight improvement over time, it falls significantly short of the results obtained when running the same code on the CPU.

Below are the results running with the GPU and epochs = 200, patience = 200 results

github-actions[bot] commented 1 year ago

πŸ‘‹ Hello @cisco-silva, thank you for your interest in YOLOv8 πŸš€! We recommend a visit to the YOLOv8 Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a πŸ› Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

Ultralytics CI

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

zshar7 commented 1 year ago

@cisco-silva That is pretty weird that your GPU is underperforming your CPU. Have you done this multiple times? I would do it about 5 more times to check. If not, I suspect that the issue is with your GPU is tweaking the model's "settings" really off. We can observe that the statistics start to improve later on about 20 epochs in. Possibly your model was lucky first tries that when adjusting coarsely to achieve better results, it became worse.

What I would recommend is to add either change your patience from 50 to more. Patience is a measure where after x epochs with no improvement, it will close. This can be achieved by doing this (100 is an example):

from ultralytics import YOLO
model = YOLO('yolov8n.pt')
results = model.train(data='coco128.yaml', epochs=60, device=0, patience=100)

Or you can disable patience by putting in zero.

I'm sorry you are experiencing these issues and hope that your model improves and makes the struggle all worth it. Feel free to ask any questions.

glenn-jocher commented 1 year ago

@zshariff6506 That's an unusual behavior indeed, ideally GPU should perform better than CPU during training. I recommend you to try this multiple times to ensure that it's not a one-time issue.

From the graphics you've posted, it appears that the model's performance starts to improve slightly around the 20th epoch. This could indicate that, at the beginning, the model's configuration is being adjusted poorly, which in turn affects the performance. But with time, as the learning adjustments get finer, we see some improvement.

One potential solution would be to increase the patience parameter's value. The patience parameter determines the number of epochs with no improvement after which training will be stopped. By increasing this value, you allow your model more training epochs, hence providing additional opportunity for improvement. Here, I recommend you to use a larger patience value, like 100.

Alternatively, you can turn off the patience parameter by setting its value to zero. This could help to see if the model's performance improves with more epochs.

I apologize for the inconvenience you've experienced and hope these suggestions will help to improve your model's performance. Let us know if you have any other questions!

glenn-jocher commented 1 year ago

@zshariff6506 I'm not able to reproduce any CPU-GPU differences using your commands. Here is what I see:

CPU

results

GPU

results

zshar7 commented 1 year ago

@glenn-jocher @cisco-silva seemed to have edited his opening message for this issue.

Honestly @cisco-silva i don’t know what is causing this issue at first glance, however, I’ll investigate further and let you know!

zshar7 commented 1 year ago

After looking at it and asking one of my friends who works with PCs I believe that it is either one of two things:

  1. A bug in ultralytics.
  2. Your GPU (GTX1650) might not be configured.

Me and my friend don't know a lot about the GTX1650 and configurations. Try installing and reinstalling ultralytics... Since yuou use python 3.11.5, try pip3.11 instead

glenn-jocher commented 1 year ago

@zshariff6506 it's indeed strange to see your CPU outperforming your GPU during model training. Let's tackle both of your concerns:

  1. Considering it's a bug in Ultralytics: While it would be natural to attribute unexpected behavior to a potential error in the library, we maintain high-quality controls and frequent testing to minimize such issues. Rest assured, we'll investigate this further.

  2. About GPU configuration: A correctly installed and configured GPU usually outperforms a CPU in tasks such as training deep learning models. In your case, it seems like there may be some misconfiguration with your GPU setup. The GTX1650 should technically perform well with Ultralytics.

Based on the information you've provided, I would suggest ensuring your GPU is properly setup and configured. Also, ensure that PyTorch, the core library used by Ultralytics, is correctly recognizing and using the GPU.

In terms of your Python version, Ultralytics is compatible with your current version, so there shouldn't be a problem there. For your pip version, pip3 is suitable for any Python 3.x installation. The exact version number isn't necessary; pip3 should cleanly install Ultralytics.

Let's continue the investigation until we find the root cause. Keep me updated with your progress!

github-actions[bot] commented 11 months ago

πŸ‘‹ Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO πŸš€ and Vision AI ⭐