ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
51.22k stars 16.44k forks source link

Training very very slowly #1241

Closed mengban closed 3 years ago

mengban commented 4 years ago

❔Question

Traing very very very slowly, and the GPU-Util is always 0 by nvidia-smi, however the GPU Memory Usage about 20G+. Is this normal?

Additional context

Here is my env: yolov5 version :83deec Python : 3.8 CUDA : 10.1 cudnn: 7.6.3 PyTorch: 1.6.0 GPU: Tesla V100 32G Mem version. I train yolov5m with 20k+ images, the GPU usage always 0.

github-actions[bot] commented 4 years ago

Hello @mengban, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook Open In Colab, Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

For more information please visit https://www.ultralytics.com.

glenn-jocher commented 4 years ago

@mengban GPU utilisation should be about 90% when running nvidia-smi. You may have environment problems. I would recommend the Docker Image as an easy way to reproduce our environment while exploiting your hardware.

Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.8 environment, clone the latest repo (code changes daily), and pip install -r requirements.txt again. We also highly recommend using one of our verified environments below.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

mengban commented 4 years ago

@mengban GPU utilisation should be about 90% when running nvidia-smi. You may have environment problems. I would recommend the Docker Image as an easy way to reproduce our environment while exploiting your hardware.

Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.8 environment, clone the latest repo (code changes daily), and pip install -r requirements.txt again. We also highly recommend using one of our verified environments below.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

Thanks for your reply. I re-install the package with pip install -r requirements.txt, and my problem still exists. And I find that the 8(num of workers)CPU works nearly 100%, so I think perhaps it's caused by my dataset. In my dataset, the image pixel is about 3000 4000 , even 6000 * 4000... and the number of box in the single image nearly 100+, so I think CPU can't feed data into GPU in time and then slow the whole training process. what do u think?

glenn-jocher commented 4 years ago

@mengban both CPU and GPU utilization should be 90-100%. 8 --workers is the default, you're free to vary as you see fit.

As I said try the docker image.

dongjuns commented 4 years ago

Docker usage link, https://docs.ultralytics.com/yolov5/environments/docker_image_quickstart_tutorial/

sudo docker run --ipc=host --gpus all -it -v "$(pwd)"/yourDirectory:/usr/src/yourDirectory ultralytics/yolov5:latest

replace 'yourDirectory' to your directory which you want to use in YOLOv5 docker container.

mengban commented 4 years ago

Docker usage link, https://docs.ultralytics.com/yolov5/environments/docker_image_quickstart_tutorial/

sudo docker run --ipc=host --gpus all -it -v "$(pwd)"/yourDirectory:/usr/src/yourDirectory ultralytics/yolov5:latest

replace 'yourDirectory' to your directory which you want to use in YOLOv5 docker container.

thanks, bro. I'll have a try.

dongjuns commented 4 years ago

+1, in the docker container, yolov5 directory placed on /usr/src/app

siyangxie commented 4 years ago

So where do you see your GPU-Util? I don't see it when training.

dongjuns commented 4 years ago

@SiyangXie Use the command in the terminal space

nvidia-smi
watch nvidia-smi
glenn-jocher commented 4 years ago

@SiyangXie @dongjuns yes the nvidia-smi command is the best way to monitor GPU stats.

A new option for monitoring GPU utilization is also W&B logging, which plots your utilization, temperature, CUDA memory over your full training run. Here are stats for a COCO128 YOLOv5x training with a V100 on Colab Pro. We are putting togethor tutorials this week for our recent W&B integration.

Screenshot 2020-11-02 at 11 32 46
github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

LegendSun0 commented 2 years ago

@mengban运行时 GPU 利用率应该在 90% 左右nvidia-smi。你可能有环境问题。我会推荐 Docker Image 作为一种简单的方法来重现我们的环境,同时利用你的硬件。 如果您尝试在本地运行 YOLOv5,请确保您满足所有依赖项要求。如果有疑问,请创建一个新的虚拟 Python 3.8 环境,克隆最新的 repo(代码每天更改),然后pip install -r requirements.txt再次克隆。我们还强烈建议使用下面我们经过验证的环境之一。

要求

**安装了所有requirements.txt依赖项的Python 3.8或更高版本,包括torch>=1.6**. 要安装运行:

$ pip install -r requirements.txt

环境

YOLOv5 可以在以下任何经过验证的最新环境中运行(预装所有依赖项,包括CUDA / CUDNNPythonPyTorch):

地位

CI CPU 测试 如果此标志为绿色,则所有YOLOv5 GitHub Actions持续集成 (CI) 测试均通过。这些测试评估基本 YOLOv5 功能的正确操作,包括MacOS、Windows 和 Ubuntu 上的训练 ( train.py )、测试 ( test.py )、推理 ( detect.py ) 和导出 ( export.py )。

感谢您的回复。我用 重新安装包pip install -r requirements.txt,我的问题仍然存在。 而且我发现 8(工人数量) CPU 几乎 100% 工作,所以我认为这可能是由我的数据集引起的。在我的数据集中,图像像素大约是 3000 4000 ,甚至是 6000 * 4000 ......并且单张图像中的框数接近 100+,所以我认为 CPU 无法及时将数据输入 GPU,然后减慢整个训练过程。你怎么看?

I have the same problem. Have you solved it?

LegendSun0 commented 2 years ago

@mengban运行时 GPU 利用率应该在 90% 左右nvidia-smi。你可能有环境问题。我会推荐 Docker Image 作为一种简单的方法来重现我们的环境,同时利用你的硬件。 如果您尝试在本地运行 YOLOv5,请确保您满足所有依赖项要求。如果有疑问,请创建一个新的虚拟 Python 3.8 环境,克隆最新的 repo(代码每天更改),然后pip install -r requirements.txt再次克隆。我们还强烈建议使用下面我们经过验证的环境之一。

要求

**安装了所有requirements.txt依赖项的Python 3.8或更高版本,包括torch>=1.6**. 要安装运行:

$ pip install -r requirements.txt

环境

YOLOv5 可以在以下任何经过验证的最新环境中运行(预装所有依赖项,包括CUDA / CUDNNPythonPyTorch):

地位

CI CPU 测试 如果此标志为绿色,则所有YOLOv5 GitHub Actions持续集成 (CI) 测试均通过。这些测试评估基本 YOLOv5 功能的正确操作,包括MacOS、Windows 和 Ubuntu 上的训练 ( train.py )、测试 ( test.py )、推理 ( detect.py ) 和导出 ( export.py )。

感谢您的回复。我用 重新安装包pip install -r requirements.txt,我的问题仍然存在。 而且我发现 8(工人数量) CPU 几乎 100% 工作,所以我认为这可能是由我的数据集引起的。在我的数据集中,图像像素大约是 3000 4000 ,甚至是 6000 * 4000 ......并且单张图像中的框数接近 100+,所以我认为 CPU 无法及时将数据输入 GPU,然后减慢整个训练过程。你怎么看?

I have the same problem. Have you solved it?

@mengban运行时 GPU 利用率应该在 90% 左右nvidia-smi。你可能有环境问题。我会推荐 Docker Image 作为一种简单的方法来重现我们的环境,同时利用你的硬件。 如果您尝试在本地运行 YOLOv5,请确保您满足所有依赖项要求。如果有疑问,请创建一个新的虚拟 Python 3.8 环境,克隆最新的 repo(代码每天更改),然后pip install -r requirements.txt再次克隆。我们还强烈建议使用下面我们经过验证的环境之一。

要求

**安装了所有requirements.txt依赖项的Python 3.8或更高版本,包括torch>=1.6**. 要安装运行:

$ pip install -r requirements.txt

环境

YOLOv5 可以在以下任何经过验证的最新环境中运行(预装所有依赖项,包括CUDA / CUDNNPythonPyTorch):

地位

CI CPU 测试 如果此标志为绿色,则所有YOLOv5 GitHub Actions持续集成 (CI) 测试均通过。这些测试评估基本 YOLOv5 功能的正确操作,包括MacOS、Windows 和 Ubuntu 上的训练 ( train.py )、测试 ( test.py )、推理 ( detect.py ) 和导出 ( export.py )。

感谢您的回复。我用 重新安装包pip install -r requirements.txt,我的问题仍然存在。 而且我发现 8(工人数量) CPU 几乎 100% 工作,所以我认为这可能是由我的数据集引起的。在我的数据集中,图像像素大约是 3000 4000 ,甚至是 6000 * 4000 ......并且单张图像中的框数接近 100+,所以我认为 CPU 无法及时将数据输入 GPU,然后减慢整个训练过程。你怎么看?

I have the same problem. Have you solved it?