ultralytics / ultralytics

NEW - YOLOv8 🚀 in PyTorch > ONNX > OpenVINO > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
25.86k stars 5.15k forks source link

Yolo v8 gets slower ans slower over time when training with multigpu #3869

Closed Solairseir closed 9 months ago

Solairseir commented 11 months ago

Search before asking

YOLOv8 Component

No response

Bug

When I train yolo v8 with two identical GPUs over time it gets slower and slower. It starts pretty good with 100% loads on both GPUs , then over time (somewhere around epoch 20, 30 ) it get slower and slower until it barely puts any load on the first GPU.

I used this setup (same GPUs, same Cuda, same Pytorch version) to train with mmsegmentation and mmdetection and did not have this issue.

Even with Yolo v8 , sometimes the train finish with out any problem until the end.

I checked for any program that uses the CPU or GPU, there is nothing else running on this system. Maybe the problem is with data loader because it never utilize all the CPU cores to 100%.

The PC has Samsung 980 SSD drive so it should not be the issue here.

My only concussion is for some reason dataloader mess up feeding data to 2 GPUs, or something like this.

by the way, this PC has 3 GPUs which only 2 of them used for training and the 3rd one is the one Monitor is connected to it and X.Org is running on it.

Environment

Minimal Reproducible Example

I'm using CLI command for training.

yolo train detect data=dataset/yolo.yaml model=yolov8n.pt device=\'0,1\' epochs=100 batch=12

Additional

No response

Are you willing to submit a PR?

github-actions[bot] commented 11 months ago

👋 Hello @Solairseir, thank you for your interest in YOLOv8 🚀! We recommend a visit to the YOLOv8 Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.7 environment with PyTorch>=1.7.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

Ultralytics CI

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 11 months ago

@Solairseir thank you for your detailed report. It seems like you're experiencing a slowdown in multi-GPU training over time. This is an unusual behavior and could potentially be tied to several factors.

As you've mentioned, the data loading efficiency can impact training speed significantly. In YOLOv8, the DataLoader is designed to pre-fetch data for the next batch while the current batch is being processed. If the DataLoader is slower than the training process, then eventually it may not be able to keep up, leading to the GPUs waiting for the data.

It's also worth taking note of the allocation of tasks between the GPUs - the GPU running X.Org displays might be experiencing a slowdown due to its rendering tasks.

However, isolating the specific cause would require further investigation. It could be a good idea to monitor GPU utilization, memory usage, and CPU usage over time as the training progresses. We would appreciate it if you can provide this information.

An alternate test to isolate the issue would be to try training on a single GPU and compare the results. If there is no slowdown observed, it could suggest an issue with multi-GPU coordination.

Lastly, YOLOv8's ability to efficiently utilize multiple GPUs depends on many factors such as the batch size, dataset size, and model size. In some circumstances, you may see diminishing returns or even negative scaling. However, this scenario usually applies to more than two GPUs and in your case, it should scale up well given that your setup works well with other frameworks. Therefore, this is likely not a factor but still worth considering.

Your efforts in identifying this issue are greatly appreciated. This helps us improve the functionality of YOLOv8, which ultimately benefits the whole community.

github-actions[bot] commented 9 months ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

siddagra commented 2 months ago

Something similar happens with me on a single gpu. Usually the dataloader keeps leaking memory and over time this can get accumulated in your page file or swap partition. Eventually it will use up most of your memory and will constantly be swapping memory from disk to RAM, grinding training to a halt.

It may or may not be a memory leak issue in your case. Usually if it is a memory issue it affects in first few epochs itself.

If it is a memory issue, Either fix the memory issue yourself using np.arrays or torch tensors (instead of lists like the current implementation), use yoloNAS which uses np.arrays, or get more RAM. Unfortunately the memory still leaks a little bit and for large enough datasets you may still run into issues.

glenn-jocher commented 2 months ago

Hi there! 👋 It sounds like you're encountering some memory management challenges during training. Memory leaks, especially in the dataloader, can indeed affect training efficiency and performance over time. If the memory is persistently escalating and causing swapping, that would explain the slowdown.

A quick tip would be to ensure that your data handling (both loading and processing) is as efficient as possible. Transitioning from lists to np.arrays or torch.Tensors could potentially alleviate some of the memory pressure by utilizing memory more effectively.

Here's a quick example of ensuring your data is in torch.Tensor:

import torch

# Assuming 'data' is your input list
tensor_data = torch.tensor(data)

Also, running with yoloNAS, which is designed with efficiency in mind, is a good suggestion. Although, it might not completely solve the problem if there's an inherent memory leak in the framework or your code.

Getting more RAM could provide a temporary fix, but it might be worthwhile to closely monitor your memory usage during the early epochs to pinpoint when and where the leak intensifies. Tools like memory_profiler in Python can help you monitor memory usage in real-time.

If the problem persists, it might help to share more details or even a minimal code snippet that reproduces the issue on the YOLOv8 GitHub issues page. Collecting more insights this way could lead to a more targeted solution.

Hang in there and keep experimenting! 🔍