ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.82k stars 16.37k forks source link

Can not allocate gpu memory #704

Closed Michaelzeyong closed 4 years ago

Michaelzeyong commented 4 years ago

❔Question

When i run phthon python train.py --data ./data/coco.yaml --cfg yolov5s.yaml --weights yolov5s.pt --batch-size 64, the error occured in model = Model(opt.cfg, nc=nc).to(device) RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 137438953472 bytes. Error code 12. I checked the code, and I foud it attempt to create a nn.conv2d with kernel size 512*512.

I add a print in the following code: class Conv(nn.Module):

Standard convolution

def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
    super(Conv, self).__init__()
    print('kernel', c1,c2,k,s,g)
    self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
    self.bn = nn.BatchNorm2d(c2)
    self.act = nn.LeakyReLU(0.1, inplace=True) if act else nn.Identity()

print reslut as follow:

from n params module arguments
kernel 12 32 32 1 1 0 -1 1 393280 models.common.Focus [3, 32, 32]
kernel 32 64 64 1 1 1 -1 1 8388736 models.common.Conv [32, 64, 64]
kernel 64 32 1 1 1 kernel 64 64 1 1 1 kernel 32 32 1 1 1 kernel 32 32 3 1 1 2 -1 1 19904 models.common.BottleneckCSP [64, 64, 1, 64]
kernel 64 128 128 1 1 3 -1 1 134217984 models.common.Conv [64, 128, 128]
kernel 128 64 1 1 1 kernel 128 128 1 1 1 kernel 64 64 1 1 1 kernel 64 64 3 1 1 kernel 64 64 1 1 1 kernel 64 64 3 1 1 kernel 64 64 1 1 1 kernel 64 64 3 1 1 4 -1 1 161152 models.common.BottleneckCSP [128, 128, 3, 128]
kernel 128 256 256 1 1 5 -1 12147484160 models.common.Conv [128, 256, 256]
kernel 256 128 1 1 1 kernel 256 256 1 1 1 kernel 128 128 1 1 1 kernel 128 128 3 1 1 kernel 128 128 1 1 1 kernel 128 128 3 1 1 kernel 128 128 1 1 1 kernel 128 128 3 1 1 6 -1 1 641792 models.common.BottleneckCSP [256, 256, 3, 256]
kernel 256 512 512 1 1

Why kernel size is 32 or 512, it is too large. And i foud it is diffrent with pretraied model yolov5s.pt.

Additional context

github-actions[bot] commented 4 years ago

Hello @Michaelzeyong, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook Open In Colab, Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

For more information please visit https://www.ultralytics.com.

Ownmarc commented 4 years ago

Lower the batch size

Michaelzeyong commented 4 years ago

Lower the batch size The error occured in model = Model(opt.cfg, nc=nc).to(device). So batchsize is not the reason. It attempt to use a very large conv kernel.

glenn-jocher commented 4 years ago

@Michaelzeyong it appears you may have environment problems. Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.8 environment, clone the latest repo (code changes daily), and pip install -r requirements.txt again. We also highly recommend using one of our verified environments below.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.