ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
49.6k stars 16.1k forks source link

Memory leak from large dataset? #12293

Closed siddagra closed 9 months ago

siddagra commented 10 months ago

Search before asking

Question

I am training on an extremely large dataset of about 3 million images. It starts off fine, using about 8-10GB of memory with a batch size of 16 and 12 workers, but as it progresses through the images during the first epoch, it starts increasing in memory usage steadily. Eventually it gets to the point that it uses all of my memory and starts iterating extremely slow, dropping from 15 iters/sec to 6 or 4 iters/sec.

I am running this on Ubuntu 20.04. 16GB of RAM; RTX 3080 GPU 10 GB VRAM; Ryzen 3600 CPU;

I suspect that perhaps it may also be that the OS is caching these files in RAM. I often find that once a chunk of files is accessed, accessing them again is much faster.

Any way to fix this? I tried taking a screenshot but it gets so laggy and the UI itself freezes.

Additional

No response

github-actions[bot] commented 10 months ago

👋 Hello @siddagra, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv5 CI

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics
glenn-jocher commented 10 months ago

@siddagra training on a dataset with 3 million images can indeed lead to increased memory usage, especially if you have a large batch size and multiple workers. The memory usage tends to increase as the training progresses through the images.

To mitigate this issue, you can try a few options:

  1. Decrease the batch size: Reduce the batch size to a smaller value, such as 8 or 4. This will reduce the memory consumption during training, at the cost of potentially slower convergence.

  2. Reduce the number of workers: Decrease the number of workers to a lower value, such as 4 or 2. This will reduce the amount of memory used by the data loading process.

  3. Use a larger machine: If possible, consider using a machine with more RAM or VRAM. This will provide more resources for training on large datasets.

Regarding your suspicion about the OS caching files in RAM, it is possible that the operating system is caching the dataset files in memory, which can improve the access speed for subsequent iterations. However, this shouldn't cause the memory usage to increase continuously.

If the memory usage keeps increasing even after trying these suggestions, it may be worth investigating if there are any data loading or preprocessing steps in your training pipeline that are causing memory leaks. Double-check your code to ensure that you are releasing any unnecessary memory in each iteration.

Please let me know if you have any further questions or if there's anything else I can assist you with.

siddagra commented 10 months ago

Thank you for the prompt response!

These are some of the photos I took on my last training session: 20231029_201244 20231029_201406 20231026_225250 20231029_201238

I tried once again; A new session after reducing num_workers to 6; batches are already low at 16 (for yolov5 small). While reducing workers and batch size does reduce memory overhead, it eventually still steadily increases throughout the epoch till OOM. It resets after the epoch is finished (as tested on a smaller dataset). I have not changed anything in the yolov5 code in terms of pre or post-processing.

Only 1.5 to 2 gb of VRAM is being used.

I also checked, it is indeed not the OS taking up that much memory. Verified that by monitoring cache.

Any tips on what could be done are much appreciated. Thank you.

siddagra commented 10 months ago

I also tried a dummy test with a pytorch dataloader and found that it does not leak memory unlike yolov5's code.

siddagra commented 10 months ago

Is there any other way to train in either yolov5 or yolov8 that is able to process an iteratable stream of dataset? I am willing to modify code if needed if you can point me in the right direction of where this memory leak may be occuring.

glenn-jocher commented 10 months ago

Hi @[user], thank you for reaching out.

Currently, in both YOLOv5 and YOLOv8, the training process relies on reading the entire dataset into memory before training starts. This loading process can be memory-intensive, especially for large datasets. However, once the dataset is loaded, the memory usage should stabilize during training.

If you're experiencing memory leaks specifically during training with YOLOv5, it's possible that there may be some issues in the code related to data loading or handling. It's difficult to pinpoint the exact cause without looking at your specific code.

To investigate further, you can try analyzing the data loading pipeline in YOLOv5, specifically the datasets.py and dataloader.py files, to check if there are any unnecessary memory allocations or leaks. You may also consider profiling the code to identify any bottlenecks or inefficiencies.

Keep in mind that modifying the YOLOv5 code is at your own risk, as it may impact the functionality and compatibility with future updates. If you can provide a minimal reproducible example or further details about your setup and code, the community and the Ultralytics team may be able to provide more specific guidance to address the memory leak issue.

Feel free to ask any further questions or provide additional information, and we'll do our best to assist you.

siddagra commented 10 months ago

After about 8 hours of bug testing. I have narrowed down the issue to it being the Mosaic Augmentation. All other things are working fine and stable. The moment I disable Mosaic Augmentation, the memory leak completely stops and it is stable at 11.2 GB without using any swap memory.

The moment I enable Mosaic Augmentation, every few iterations it increases memory consumption. It is also not a sudden spike and then back down to normal memory usage; rather it just slowly accumulates till CPU RAM goes OOM.

I tried fixing it by del deleting all variables that are not returned by collate_fn4, load_mosaic and load_mosaic9 functions. It did not fix the issue. I will likely just run training without mosaic.

Thanks a lot for your support.

glenn-jocher commented 10 months ago

@siddagra thank you for bringing this issue to our attention and providing detailed information about the memory leak you encountered during training with Mosaic Augmentation. We appreciate your efforts in troubleshooting and narrowing down the problem.

Based on your findings, it appears that the Mosaic Augmentation is causing the memory leak issue, as disabling it stops the memory consumption from increasing. Additionally, you mentioned that explicitly deleting variables that are not returned by collate_fn4, load_mosaic, and load_mosaic9 functions did not resolve the problem.

We acknowledge your decision to proceed with training without using Mosaic Augmentation to ensure stable memory usage. It's worth noting that Mosaic Augmentation can be memory-intensive due to the need for composing and manipulating multiple images.

We will take note of this issue and investigate further to identify the root cause of the memory leak associated with Mosaic Augmentation. This will help us address the problem and provide a more reliable and efficient training experience in the future.

Thank you again for your feedback and assistance in improving YOLOv5. Please don't hesitate to reach out if you have any further questions or concerns.

siddagra commented 10 months ago

You are welcome and thank you for your support as well. I still see some memory leak on yolov5 with this approach (though considerably less) and no memory leak on yolov7 with this approach. so i will likely recode the dataloader to not store all file paths and see if that helps, there also seems to be an article about this: https://docs.aws.amazon.com/codeguru/detector-library/python/pytorch-data-loader-with-multiple-workers/

If that doesn't work I guess I will use yolov4 Darknet (C++) or yolov7 (though that trains slowly).

glenn-jocher commented 10 months ago

@siddagra thank you for your feedback and for sharing your findings regarding the memory leak issue with YOLOv5 when using Mosaic Augmentation. We appreciate your efforts in troubleshooting and narrowing down the problem. It's interesting that you observed a considerably reduced memory leak with this approach in YOLOv7 and no memory leak in YOLOv7.

Considering your next steps, recoding the dataloader to not store all file paths seems like a potential solution worth exploring. Additionally, the article you mentioned, which provides insights on PyTorch dataloader with multiple workers, could also offer helpful information for addressing the memory leak issue.

If these steps don't resolve the problem, using YOLOv4 Darknet (C++) or YOLOv7 could be alternative options. YOLOv4 Darknet (C++) offers the advantage of being implemented in C++, while YOLOv7 might have slower training times.

Thank you for bringing this issue to our attention, and for your patience and support while we work on addressing the memory leak. If you have any further questions or need additional assistance, please don't hesitate to reach out.

github-actions[bot] commented 9 months ago

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐